Micai 13 contextualized practical speech

Practical Speech Recognition for Contextualized Service Robots

Departamento de Ciencias de la ComputaciónInstituto de Investigaciones en Matemáticas Aplicadas y en Sistemas

Universidad Nacional Autónoma de México

http://golem.iimas.unam.mx/

Ivan Meza, Caleb Rascón and Luis Pineda

GrupoGolem

Service robots● Our future butlers ● They are task oriented

○ Clean up a room○ Play a game

● Interaction with spoken language ● They work in noisy environments● Microphone is not close to the speaker● Poor speech recognition

Proposal● Improve the system on four aspects

● Contextualized recogniser

● Prompting strategies

● Recovery strategies

● Audio calibration

I. Contextualized recognition

● Use specific language models for the given expectations

■ YES: yes, okay, all right■ NO: no, don’t, do not

■ NAVIGATE: go to the kitchen, go to the living room, go to the bedroom

ASR module

II. Prompting strategies

● Let know the user when to speak

■ Beep sound

● Speaker volume monitor

■ Could you speak louder or softer

III. Recovery strategy

● Let know the user when something went wrong

■ could you repeat? ■ i can’t hear you well, could you repeat■ sorry, i’m a little deaf

IV. Calibration of audio setting

● Hardware■ 1 directional microphone■ 1 USB interface with 4 channels■ 2 speakers

● Calibration of SNR in situ■ For background noise -58dB■ SNR set to 20 dB

Corpus evaluation

● Logs from the robot performing RoboCup tasks■ 2 years interactions in lab and competition■ 1,439 utterances■ 2,472 tokens■ 120 types■ 11 tasks■ 9 of 11 tasks are contextualized■ 14 language models

Contextualized recognitionWe measure WER (the lower the better)

● With a unique LM for all tasks: 53.84%

● With task-based LM: 28.28%

● With contextualized: 23.42%

17.2% relative error reduction

Beep sound

● 79 utterances were recorded without the beep sound

■ Without beeps 55.86%

■ With beeps 39.75%

■ With beeps full 53.72%

30%-4% Relative error reduction

Usage of SoundLoc System ● We measure usage

■ 174 times could have been triggered

■ 21 soft speech

■ 4 louder

14.36% of the times

Recovery strategy ● We measure usage

■ 504 times could have been triggered

■ 85 times activated

16.87% of the times

Conclusions

● These strategies help to improve in small amounts the performance

● Together they allow practical speech recognition on a service robot

Thank you

● ¿Questions?

Technology

Micai 13 contextualized practical speech