17
WOZ acoustic data WOZ acoustic data collection for collection for interactive TV interactive TV A. Brutti*, L. Cristoforetti* , W. Kellermann+, L. Marquardt+, M. Omologo* * Fondazione Bruno Kessler (FBK) - irst Via Sommarive 18, 38050 Povo (TN), ITALY + Multimedia Communications and Signal Processing, University of Erlangen-Nuremberg (FAU) Cauerstr. 7, 91058 Erlangen, GERMANY LREC 2008 – Marrakech, 28-30/05/08

WOZ acoustic data collection for interactive TV

  • Upload
    adonia

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

WOZ acoustic data collection for interactive TV. A. Brutti*, L. Cristoforetti* , W. Kellermann+, L. Marquardt+, M. Omologo* * Fondazione Bruno Kessler (FBK) - irst Via Sommarive 18, 38050 Povo (TN), ITALY - PowerPoint PPT Presentation

Citation preview

Page 1: WOZ acoustic data collection for interactive TV

WOZ acoustic data WOZ acoustic data collection for interactive TVcollection for interactive TV

A. Brutti*, L. Cristoforetti*, W. Kellermann+, L. Marquardt+, M. Omologo*

* Fondazione Bruno Kessler (FBK) - irst

Via Sommarive 18, 38050 Povo (TN), ITALY

+ Multimedia Communications and Signal Processing, University of Erlangen-Nuremberg (FAU)

Cauerstr. 7, 91058 Erlangen, GERMANY

LREC 2008 – Marrakech, 28-30/05/08

Page 2: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

When did she speak?

What does she say?

What is the output from each loudspeaker? How is it at each microphone?

Who is she?

Robustness in a real reverberant environment

Noise sources?

Where is she? What is her head orientation? Other

speakers?

The DICIT EU projectThe DICIT EU projectDistant-talking Interfaces for Control of Interactive TVDistant-talking Interfaces for Control of Interactive TV

Page 3: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

The DICIT Project

STREP Project – FP6 Strategic objective: 2.5.7 – Multimodal Interfaces Duration: October 2006 – September 2009

Page 4: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

What is a Wizard of Oz (WOZ) What is a Wizard of Oz (WOZ) experiment?experiment?

A subject is requested to complete specific tasks using an artificial system

The user is told that the system is fully functional and should try to use it in a intuitively way

The system is operated by a person (wizard) not visible to the subject

The wizard can react in a more comprehensive way and can create particular situations

6

Page 5: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

Why a WOZ data collection?Why a WOZ data collection?

We needed to collect an acoustic database for testing pre-processing algorithms:

– acoustic scene analysis– speaker ID and verification– echo cancelation– blind source separation– beamforming– speaker localization and tracking– distant automatic speech recognition

With a WOZ, realistic scenarios can be simulated at a preliminary stage, allowing for repeatable experiments

There is no need to have a full-working system in order to collect real data

Naïve users, do not have the same behavior as expert users, they use the system in a realistic way

7

Page 6: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

The DICIT WOZThe DICIT WOZ

Experiments were conducted in the laboratories at FBK and FAU

A room was used as living room with TV, loudspeakers and seatsAn adjacent room was used by the wizard and the simulation system, not visible from the users

Users watched the TV and had to interact with it by voice and remote control, to change channels and to retrieve information from the teletext pages

At some point, they had to move around and speak with the system

8

Page 7: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

Strategy for the recordingsStrategy for the recordings

Four users sit in the room, but one of them was the co-wizard, that ensured the regularity of the experiment and produced some acoustic events

Users were recorded by close talk and far microphones

Interactions will be recorded by 3 fixed cameras that allow the automatic tracking of users movements

Recordings were done on Italian/German/English groups

9

Page 8: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

The FBK experimental roomThe FBK experimental room

12

A harmonic 15-electret-microphone array was developed on purpose and located over the TV

Page 9: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

Clip from a recorded sessionClip from a recorded session

14

Page 10: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

WOZ preparationWOZ preparation

12 video clips and 100 teletext pages were recorded from real TV, everything was available in 3 languages

Stereo audio channels were extracted and decorrelated (by FAU) for the echo canceller and clips were recreated to fit the simulation

The system was controlled by a PC running Elektrobit EB GUIDE Studio simulator tool

A remote control infrared receiver was integrated into the system and enabled the users to use a real remote control to pilot the TV

15

Page 11: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

Recorded sessionsRecorded sessions

FBK and FAU recorded different sessions using a similar setup, in different languages

Each user interaction lasted about 10 minutes, in total 360 minutes of recordings

24 or 26 synchronous channels were recorded at 48kHz with 16-bit precision + 64 channels from the MarkIII array at 44.1kHz and 24 bits

Site Language Number of sessions

FBK Italian 6

FAU German 5

FAU English 1

17

Page 12: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

Data annotationData annotation The 6 Italian sessions have been manually transcribed and segmented at

word level, using Transcriber

An automatic segmentation was obtained with a tool based on energy of the close-talk signals, then adjusted when necessary

A stereo file was created, with two channels for close-talk and environment sounds to ease the annotation process

Annotation comprises the speaker ID, the transcription of uttered sentence and any noise included in the acoustic event list

Specific labels for acoustic events have been introduced, following a defined guideline

Video data has been used to derive 3D coordinates for the head of the speaker and reference files were created with a frame rate of 5 labels per second

18

Label Acoustic Event [sla] door slamming [cha] chair moving [pho] phone ringing (various rings) [cou] coughing [lau] laughing [fal] objects falling (water bottle, book) [pap] paper rustling (newspaper, magazine) [spk] noises from speaker mouth [unk] other unknown noises

Page 13: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

Data exploitation / testingData exploitation / testing

Data have been used for a preliminary evaluation of some FBK algorithms:– localization techniques, precision is around 30 cm– 682 + 108 audio segments have been used for the

acoustic event classification system, 92% of accuracy– data have been used to test the speaker verification and

identification system, but close-talk is still better that beamformed signal

Room impulse response measurements have been carried out at both sites, in different positions. They are useful for i.e. speech contamination purposes

19

Page 14: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

Transcriber session

Page 15: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

ConclusionsConclusions

This collection of data has been the first of its kind and is of significant benefit to acoustic front-end algorithms and dialogue strategies

36 naïve persons have been recorded, leading to 360 minutes of signals, on 24-26 different channels recorded in a synchronous way (125 GB of data)

Users enjoyed the system and tolerated some recognition errors, they preferred voice modality over remote control interaction

21

Page 16: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

Current status of the projectCurrent status of the project

The project is in the second year

We just finished to integrate the first prototype

Ready to start the evaluation of the prototype

More information and demo clips can be found at

http://dicit.fbk.eu

22

Page 17: WOZ acoustic data collection for interactive TV

Luca CristoforettiLREC 2008 – Marrakech, 28-30/05/08

Thank You!