View
47
Download
0
Category
Preview:
DESCRIPTION
Speech recognition in MUMIS. Mirjam Wester, Judith Kessens & Helmer Strik. Intro. Objective: Automatic speech recognition of football commentaries SPEX transcribed two matches for two languages (Dutch and English): England - Germany (Eng-Dld) and Yugoslavia -The Netherlands (Yug-Ned) - PowerPoint PPT Presentation
Citation preview
Speech recognition in MUMIS
Mirjam Wester, Judith Kessens
& Helmer Strik
Intro
• Objective: Automatic speech recognition of football commentaries
• SPEX transcribed two matches for two languages (Dutch and English):– England - Germany (Eng-Dld) and – Yugoslavia -The Netherlands (Yug-Ned)
• Commentaries and stadium noise are mixed
Data Conversion
• SPEX transcription:– text grid:
• orthographic transcription
• chunk alignment; chunk = a segment of speech of about 2 to 3 seconds
– CD with one large wav file
• Split according to chunk alignments
Examples of data
• Yug-Ned Dutch
• Yug-Ned English
• Eng-Dld Dutch
• Eng-Dld English
Statistics
Dutch English
#chunks 5146 5613
#speech chunks 3006 3725
#empty chunks 2140 1843
#words (types) 1954 2923
#words (tokens) 12079 24022
English matches have two commentators, Dutch only one.Overlapping segments have been disregarded.
TrainingDutch:• Yug-Ned ¾ of CD (19 min speech)• France Telecom Noise Reduction (FTNR)
English:• Yug-Ned ¾ of CD (28 min speech)• FTNR
For more information on France Telecom Noise Reduction tool see: B. Noé, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth & F. de Wet “Noise Reduction for Noise Robust Feature Extraction for Distributed Speech Recognition”. In Proc. of Eurospeech ’01
TestDutch:• Yug-Ned ¼ of CD
– 626 chunks, 1577 words– lexicon and language model based on complete Yug-
Ned match
English:• Yug-Ned ¼ of CD
– 636 chunks, 2641 words– lexicon and language model based on complete Yug-
Ned match
SNR before and after FTNR tool
WER results for Yug-Ned before and after FTNR
40
45
50
55
60
NL-original NL-FTNR Eng-Original Eng-FTNR
Training material acoustic models
WE
R(%
)
Dutch – Polyphone
• Data is phonetically rich sentences
• Phone models were trained on:– Polyphone all speakers– Polyphone male speakers– Polyphone male speakers + MUMIS noise
• Polyphone as bootstrap for segmentation of MUMIS material
Polyphone models (Dutch)Yug-Ned test set
45
55
65
75
85
95
Poly-all Poly-male Poly-male+noise Poly-seg.MUMIS
Training material acoustic models
WE
R(%
)
Cross tests (Dutch & English)
Cross-tests:
• train on ¾ Yug-Ned test on ¼ Eng-Dld
• train on ¾ Eng-Dld test on ¼ Yug-Ned
MUMIS models (Dutch)
45
50
55
60
65
70
Yug-Ned Eng-Dld-cross Eng-Dld Yug-Ned-cross
Training material acoustic models
WE
R(%
)
Yug-Ned test Eng-Dld test
MUMIS models (English)
45
50
55
60
65
70
Yug-Ned Eng-Dld-cross Eng-Dld Yug-Ned-cross
Training material acoustic models
WE
R(%
)
Yug-Ned test Eng-Dld test
MUMIS models (Dutch+English)
45
50
55
60
65
70
Yug-Ned Eng-Dld-cross Eng-Dld Yug-Ned-cross
Training material acoustic models
WE
R(%
)
NLENG
Yug-Ned test Eng-Dld test
Function words vs content words
0
10
20
30
40
50
60
70
80
Yug-Ned Eng-Dld Yug-Ned Eng-Dld
WE
R(%
) functioncontentnamesall
word type
English data Dutch data
SNR vs. WER (1)Dutch Data
0
1020
3040
5060
7080
90
0 5 10 15 20 25 30
SNR1 (dB)
WE
R (
%)
YugNed YugNed_ftnr EngDld
SNR vs. WER (2)English Data
0
1020
3040
5060
7080
90
0 10 20 30 40
SNR1 (dB)
WE
R (
%)
YugNed YugNed_ftnr EngDld
Discussion
• WERs are high• Noise?
– FTNR leads to lower SNR, but WERs do not improve substantially
• Not enough training data?– Polyphone for training/bootstrapping does not lead to
lower WERs than training on MUMIS data
– Noisifying Polyphone with MUMIS gives encouraging results
Discussion continued
• Function words comprise ± 50% of the data, and cause great deal of the errors
• Names are recognized very well
• Function words not necessary for information extraction (?)
Future work• Steps to noise robust speech recognition:
– model/speaker adaptation– combinations of noisified Polyphone models
and FTNR
• Other issues:– transcription of more data
• English, Dutch and German• preference specific games? radio? TV?
– generic football specific language model– confidence measures?
Future work continued
Questions: • What type of output from ASR is needed?
– word-graph
– n-best list
– top of the list
– word spotting? only content words?
• For research purposes: is it possible to obtain data that has not been mixed (noise + commentary)?
Recommended