23
Speech recognition in MUMIS Mirjam Wester, Judith Kessens & Helmer Strik

Speech recognition in MUMIS

  • Upload
    iago

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Speech recognition in MUMIS. Mirjam Wester, Judith Kessens & Helmer Strik. Intro. Objective: Automatic speech recognition of football commentaries SPEX transcribed two matches for two languages (Dutch and English): England - Germany (Eng-Dld) and Yugoslavia -The Netherlands (Yug-Ned) - PowerPoint PPT Presentation

Citation preview

Page 1: Speech recognition in MUMIS

Speech recognition in MUMIS

Mirjam Wester, Judith Kessens

& Helmer Strik

Page 2: Speech recognition in MUMIS

Intro

• Objective: Automatic speech recognition of football commentaries

• SPEX transcribed two matches for two languages (Dutch and English):– England - Germany (Eng-Dld) and – Yugoslavia -The Netherlands (Yug-Ned)

• Commentaries and stadium noise are mixed

Page 3: Speech recognition in MUMIS

Data Conversion

• SPEX transcription:– text grid:

• orthographic transcription

• chunk alignment; chunk = a segment of speech of about 2 to 3 seconds

– CD with one large wav file

• Split according to chunk alignments

Page 4: Speech recognition in MUMIS

Examples of data

• Yug-Ned Dutch

• Yug-Ned English

• Eng-Dld Dutch

• Eng-Dld English

Page 5: Speech recognition in MUMIS
Page 6: Speech recognition in MUMIS

Statistics

Dutch English

#chunks 5146 5613

#speech chunks 3006 3725

#empty chunks 2140 1843

#words (types) 1954 2923

#words (tokens) 12079 24022

English matches have two commentators, Dutch only one.Overlapping segments have been disregarded.

Page 7: Speech recognition in MUMIS

TrainingDutch:• Yug-Ned ¾ of CD (19 min speech)• France Telecom Noise Reduction (FTNR)

English:• Yug-Ned ¾ of CD (28 min speech)• FTNR

For more information on France Telecom Noise Reduction tool see: B. Noé, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth & F. de Wet “Noise Reduction for Noise Robust Feature Extraction for Distributed Speech Recognition”. In Proc. of Eurospeech ’01

Page 8: Speech recognition in MUMIS

TestDutch:• Yug-Ned ¼ of CD

– 626 chunks, 1577 words– lexicon and language model based on complete Yug-

Ned match

English:• Yug-Ned ¼ of CD

– 636 chunks, 2641 words– lexicon and language model based on complete Yug-

Ned match

Page 9: Speech recognition in MUMIS

SNR before and after FTNR tool

Page 10: Speech recognition in MUMIS

WER results for Yug-Ned before and after FTNR

40

45

50

55

60

NL-original NL-FTNR Eng-Original Eng-FTNR

Training material acoustic models

WE

R(%

)

Page 11: Speech recognition in MUMIS

Dutch – Polyphone

• Data is phonetically rich sentences

• Phone models were trained on:– Polyphone all speakers– Polyphone male speakers– Polyphone male speakers + MUMIS noise

• Polyphone as bootstrap for segmentation of MUMIS material

Page 12: Speech recognition in MUMIS

Polyphone models (Dutch)Yug-Ned test set

45

55

65

75

85

95

Poly-all Poly-male Poly-male+noise Poly-seg.MUMIS

Training material acoustic models

WE

R(%

)

Page 13: Speech recognition in MUMIS

Cross tests (Dutch & English)

Cross-tests:

• train on ¾ Yug-Ned test on ¼ Eng-Dld

• train on ¾ Eng-Dld test on ¼ Yug-Ned

Page 14: Speech recognition in MUMIS

MUMIS models (Dutch)

45

50

55

60

65

70

Yug-Ned Eng-Dld-cross Eng-Dld Yug-Ned-cross

Training material acoustic models

WE

R(%

)

Yug-Ned test Eng-Dld test

Page 15: Speech recognition in MUMIS

MUMIS models (English)

45

50

55

60

65

70

Yug-Ned Eng-Dld-cross Eng-Dld Yug-Ned-cross

Training material acoustic models

WE

R(%

)

Yug-Ned test Eng-Dld test

Page 16: Speech recognition in MUMIS

MUMIS models (Dutch+English)

45

50

55

60

65

70

Yug-Ned Eng-Dld-cross Eng-Dld Yug-Ned-cross

Training material acoustic models

WE

R(%

)

NLENG

Yug-Ned test Eng-Dld test

Page 17: Speech recognition in MUMIS

Function words vs content words

0

10

20

30

40

50

60

70

80

Yug-Ned Eng-Dld Yug-Ned Eng-Dld

WE

R(%

) functioncontentnamesall

word type

English data Dutch data

Page 18: Speech recognition in MUMIS

SNR vs. WER (1)Dutch Data

0

1020

3040

5060

7080

90

0 5 10 15 20 25 30

SNR1 (dB)

WE

R (

%)

YugNed YugNed_ftnr EngDld

Page 19: Speech recognition in MUMIS

SNR vs. WER (2)English Data

0

1020

3040

5060

7080

90

0 10 20 30 40

SNR1 (dB)

WE

R (

%)

YugNed YugNed_ftnr EngDld

Page 20: Speech recognition in MUMIS

Discussion

• WERs are high• Noise?

– FTNR leads to lower SNR, but WERs do not improve substantially

• Not enough training data?– Polyphone for training/bootstrapping does not lead to

lower WERs than training on MUMIS data

– Noisifying Polyphone with MUMIS gives encouraging results

Page 21: Speech recognition in MUMIS

Discussion continued

• Function words comprise ± 50% of the data, and cause great deal of the errors

• Names are recognized very well

• Function words not necessary for information extraction (?)

Page 22: Speech recognition in MUMIS

Future work• Steps to noise robust speech recognition:

– model/speaker adaptation– combinations of noisified Polyphone models

and FTNR

• Other issues:– transcription of more data

• English, Dutch and German• preference specific games? radio? TV?

– generic football specific language model– confidence measures?

Page 23: Speech recognition in MUMIS

Future work continued

Questions: • What type of output from ASR is needed?

– word-graph

– n-best list

– top of the list

– word spotting? only content words?

• For research purposes: is it possible to obtain data that has not been mixed (noise + commentary)?