Download ppt - ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent

Recognition of foreign names spoken by native speakers

Frederik Stouten & Jean-Pierre Martens

Ghent University

Electronics and Information Systems (ELIS)

ELIS-DSSPSint-Pietersnieuwstraat 41B-9000 Gent Interspeech 07 - August 30th 07 2

Overview

• Problem statement• Methodology

– computing phonological scores– foreignizable phonemes

• Experiments– baseline system– systems with methodology implementation

• Conclusions


• Automatic attendant or car navigation systems– lexicon may contain > 100K words– many from foreign origin

• Native speaker of Dutch can pronounce Andrew as

Problem statement


• Automatic attendant or car navigation systems– lexicon may contain > 100K words– many from foreign origin

• Native speaker of Dutch can pronounce Andrew as• nativized A n d r E w• intermediate E n d r u w• foreignized E n d r u

Problem statement


• Standard solutions– foreign g2p’s + mapping to native phonemes– include foreign phoneme acoustic models

• Our proposal – combine scores of standard acoustic models and

phonologically inspired back-off model • both models trained on native speech only

– use foreign g2p’s without phoneme mapping– introduce foreignizable phonemes instead of traditional

foreign-to-native phoneme mappings

Problem situation


Combining scores

• two-stream score per acoustic model state q– standard model : log pA(x | q)

– phonological back-off model : log pB(x | q)

• control parameters– g1q, g2q = state dependent stream weight

(different risk for foreignized pronunciation)

– α, β = state independent scaling coefficients (to get same overall mean, variance)

– equidistant samples on g1q+ g2q = 1 (factor has no effect)

LL(xjq) = g1q logpA (xjq) + g2q[®logpB (xjq) ¡ ¯]


Combining scores

• Computation of log pB(x | q)

– phonological feature space: binary features fi (i=1,…,25)

– map each state to phonological space• select features of state on basis of forced alignment of speech

with standard acoustic models

• select fi with large enough mean of P(fi | x) / P(fi ) on state

• other strategy for foreignizable phonemes (see further)

– compute posterior probabilities P(fi | x) • configuration of 4 neural networks


Combining scores

• Computation of log pB(x | q)

– phonological feature space: binary features fi (i=1,…,25)

– map each state to phonological space• select features of state on basis of forced alignment of speech

with standard acoustic models

• select fi with large enough mean of P(fi | x) / P(fi ) on state

• other strategy for foreignizable phonemes (see further)

– compute posterior probabilities P(fi | x) • configuration of 4 neural networks

– convert posterior probabilities to log-likelihood logpB (xjq) = log

PB (qjx)PB (q)

+ logpB (x)


Combining scores

• Come to final two-stream score

– g2q less dependent on q than

– g2q log pB (x) = discardable

– computation of log PB(q | x) / PB(q)• Pq : positive features that are ‘on’ for state q• Nq : negative features absent or ‘off’ for q

LL(xjq) = g1q logpA (xjq) + g2q[®logPB (qjx)PB (q)

¡ ¯]

logPB (qjx)PB (q)


Combining scores

• Assuming independent PHFs we get

(1) (2)• Start with only positive features (term (1))

– problem : unequal number for different q– solution : take average or wqp x (1), with wqp = 1 / card(Pq)– experiment showed this is better

• Add negative features (term (2))– supposed to represent same probability– experiment shows 75 % correlation between (1) and (2)– keeping (1) + (2) is slightly better than discarding (2)

logPB (qjx)PB (q)

=X

f i 2Pq

logP (f i jx)P (f i )

+X

f i 2N q

log1¡ P (f i jx)1¡ P (f i )


Introducing foreignizable phonemes

• Baseline pronunciation of foreign name – take foreign language g2p output– map foreign phonemes to best native equivalent

• Our pronunciation– if equivalent has different PHFs

keep info of original

foreignizable phoneme: /NativePhon/_/ForeignPhon/

– e.g. /rr/ /r/_/rr/ (Dutch /r/ originating from English /rr/)– 6 such phonemes for English Dutch– use positive PHFs of /ForeignPhon/ (knowledge based)



• Pronunciation variants– mix of standard and new approach



• Pronunciation variants– mix of standard and new approach

name transcription

Alan Presser baseline E l @ n _ p r E s @ r

alternative 1 E l @ n _ p r_rr E s @ r

alternative 2 E l @ n _ p r E s @ r_rr

alternative 3 E l @ n _ p r_rr E s @ r_rr


Experiments

• Recognition of English names– database from Nuance (Cremelie, N and ten Bosch, L)– 2050 English name utterances – 21 different names– 26 native speakers of Dutch

• Recognizer– Standard acoustic models: cross-word triphones, trained on

Dutch read speech– PHF feature detector: neural network configuration, trained on

Dutch read speech– Vocabulary: 21 English names + 1779 Dutch names– Lexicon: different transcriptions for each name (see next slide)


Baseline system

• No back-off model used• Effects of different types of

transcriptions measured


Baseline system


transcriptions measuredlexicon WER CI95(%)

DuAlone 30.3 28.4-32.3

DuMan 23.5 21.6-25.3

EngAlone 23.1 21.2-24.9

EngDu 18.2 16.5-19.9

EngMan 16.8 15.2-18.4

ManAlone 24.7 22.8-26.5


Baseline system


transcriptions measured• Most important findings

1. English much better than Dutch transcriptions (alone)

model foreign pronunciations

2. Dutch transcriptions inevitable

model native pronunciations

lexicon WER CI95(%)

DuAlone 30.3 28.4-32.3

DuMan 23.5 21.6-25.3

EngAlone 23.1 21.2-24.9

EngDu 18.2 16.5-19.9

EngMan 16.8 15.2-18.4

ManAlone 24.7 22.8-26.5


Systems with back-off model

• system FOREIGN– consider one foreignizable phonemes at the time– same g1 on all its states : find optimal value under

condition that g1 = 1 for all other phonemes– repeat process until all foreignizable phonemes treated

• system NATIVE– same g1 on all states– search for best g1

• system ALL– foreignizable phonemes : g1 = from FOREIGN– other phonemes: same g1, g1 = from NATIVE


Systems with back-off model

• Main results : relative improvement of 11%

• Other results– g1 < 0.5 for system FOREIGN

– g1 > 0.5 for system NATIVE

system g1 g2 WER(%)

BASELINE 1 0 18.2

FOREIGN opt. opt. 16.5

NATIVE 0.7 0.3 17.3

ALL opt+0.7 opt+0.3 16.2


Latest work

• Seek confirmation of results on other data• Autonomata database (STEVIN-project)

– 60000 names, 5000 different names, 240 speakers– French + English + Dutch names– French + English + Dutch speakers– French + English + Dutch g2p outputs per name– large RI by using foreign g2p’s on French and English– much larger RI with our methodology than here– paper submitted to ASRU-2007


Conclusions as of today

• large improvements on foreign name recognition by adding foreign g2p outputs (RI of around 40%)

• substantial extra improvements by adding new methodology (RI of up to 30%)