10.1.1.202.6317

Embed Size (px)

Citation preview

  • 7/28/2019 10.1.1.202.6317

    1/4

    A Comparison of Normalization and Training Approachesfor ASR-Dependent Speaker Identication 1

    Alex Park and Timothy J. Hazen

    MIT Computer Science and Articial Intelligence Laboratory32 Vassar Street, Cambridge, MA 02139, USA

    {malex, hazen }@sls.csail.mit.edu

    Abstract

    In this paper we discuss a speaker identication approach,called ASR-dependent speaker identication, that incorporatesphonetic knowledge into the models for each speaker. Thisapproach differs from traditional methods for performing text-independent speaker identication, such as global Gaussian

    mixture modeling, that typically ignore the phonetic content of the speech signal. We introduce a new score normalization ap-proach, called phone adaptive normalization, which improvesupon our previous speaker adaptive normalization technique.This paper also examines the use of automatically generatedtranscriptions during the training of our speaker models. Ex-periments show that speaker models trained using automaticallygenerated transcriptions achieve the same performance as mod-els trained using manually generated transcriptions.

    1. IntroductionTraditional methods for performing text independent speakeridentication, such as the use of global Gaussian mixturespeaker models (GMMs), typically ignore the phonetic con-

    tent of the speech signal [5]. Although some researchershave sought to address this deciency through phone-dependentmodeling [1], even these approaches assume no phoneticknowledge of the test utterance. In a previous paper, wedescribed a speaker identication approach that incorporatessuch phonetic knowledge into the models for each speaker [4].Specically, this approach rescores the best sentence hypothe-sis obtained from a speaker independent speech recognizer us-ing speaker dependent versions of the context-dependent acous-tic models used by the recognizer. This method is similar tothe LVCSR-based speaker identication approach developed byDragon Systems and described by Weber et al. in [8]. Becauseour approach relies on the output of an automatic speech recog-nition (ASR) system, we refer to it as ASR-dependent speaker identication . We have incorporated our speaker ID system intoconversational systems at MIT to improve both security (to pro-tect condential user account information contained) and con-venience (to avoid cumbersome login sub-dialogues) [3].

    In this paper, we explore two issues associated with ASR-dependent speaker identication. First, we examine the issue of score combination and normalization. Typical speech recogniz-ers have hundreds if not thousands of context-dependent (CD)acoustic models. However, the enrollment data available for anyparticular speaker may be limited, and therefore many of the CDacoustic models within that speakers model set may have only

    1 This work was sponsored in part by an industrial consortium sup-porting the MIT Oxygen Aliiance.

    a small number of training observations (or even none at all).In this case, appropriate mechanisms for scoring observationsof these poorly trained phonetic events during the evaluation of a new utterance must be developed. In this paper we examinea novel score combination and normalization approach called phone adapt ive normalizat ion , and compare it with the speaker adaptive normalization introduced in our previous work.

    Second, because our approach relies on the imperfect out-put of an ASR system, it is important to consider the potentialeffects of speech recognition errors during both the enrollmentand evaluation processes. To this end, we compare the use of ac-curate manual transcriptions versus inaccurate automatic tran-scriptions of the enrollment data when training the speaker iden-tication models, and consider how these training approachesare affected by speech recognition errors during evaluation.

    2. ASR-Dependent Speaker IdenticationWe will make use of the following notation when describingthe ASR-dependent speaker identication approach and its cor-responding normalization methods: Let X represent the set of feature vectors, {x 1 , . . . , x N }, extracted from a particular spo-ken utterance. Let the reference speaker of an utterance X beS (X ). Furthermore, assume that the aligned phonetic transcrip-tion of the utterance, (X ), provides a mapping between eachfeature vector x k and its underlying phonetic unit (x k ) .

    In our ASR-dependent approach, each speaker S is rep-resented by a set of models, p(x |S, ), which mirror the CDacoustic models trained for speech recognition, p(x |) , where ranges over the inventory of phonetic units used in the speechrecognizer. The use of a set of context-dependent phonetic mod-els for each speaker is markedly different from global GMMmodelling approaches, where the goal is to represent a speakerwith a single model, p(x |S ) . During evaluation, automaticspeech recognition is performed on the utterance producing anautomatically generated phonetic transcription, (X ), which

    assigns each vector, x k , to its most likely phonetic unit, (x k ).The phone assignments generated during speech recogni-tion can then be used to calculate speaker-dependent phone-dependent conditional probabilities, p(x |S, (x )) . Ideally,these probabilities alone would act as a suitable speaker scoresfor making a speaker identication decision. For example, theclosed-set speaker identication result might be

    S (X ) = arg maxS

    p(X |S, (X ))

    In practice however, enrollment data sets for each speaker aretypically not large enough to accurately determine the param-eters of p(x |S, (x )) for all (x ). To compensate for this

  • 7/28/2019 10.1.1.202.6317

    2/4

    problem, we can consider several normalization approacheswhich can synthesize an appropriate speaker score even forunits which have sparse enrollment data.

    3. Normalization ApproachesIn this section, we present two normalization techniques whichaddress the problem of constructing robust speaker scores whenenrollment data for each speaker is unevenly distributed overthe library of context-dependent phonetic events. The choice of normalization technique becomes especially important when acontext-dependent phonetic event has few or no training tokensin the enrollment data.

    3.1. Speaker Adaptive (SA) Normalization

    We originally described a speaker adaptive normalization ap-proach in [4]. This technique relies on interpolating speaker de-pendent (SD) probabilities with speaker independent (SI) prob-abilities on a per-unit basis. This approach learns the character-istics of a phone for a given speaker when sufcient enrollmentdata is available, but relies more on general speaker independentmodels in instances of sparse enrollment data.

    Mathematically, the speaker score can be written as

    Y (X , S ) = 1|X |

    x X

    log S, ( x ) p(x |S, (x ))

    p(x | (x ))

    +(1 S, ( x ) ) p(x | (x )) p(x | (x ))

    (1)

    where S, ( x ) is the interpolation factor given by

    S, ( x ) =n S, ( x )

    n S, ( x ) + (2)

    In this equation, n S, ( x ) refers to the number of times the CD

    phonetic event (x ) was observed in the enrollment data forspeaker S , and is an empirically determined tuning parameterthat was the same across all speakers and phones.

    By using the SI models in the denominator of the terms inEquation 1, the SI model set acts as the normalizing background model typically used in speaker verication approaches. Theinterpolation between SD and SI models allows our techniqueto capture detailed phonetic-level characteristics when a suf-cient number of training tokens are available from a speaker,while falling back onto the SI model when the number of train-ing tokens is sparse. In other words, the system backs off to-wards a neutral score of zero when a particular CD phoneticmodel has little or no enrollment data from a speaker. If an en-rolled speaker contributes more enrollment data, the varianceof the normalized scores increases and the scores become morereective of how well (or poorly) a test utterance matches thecharacteristics of that speakers model.

    3.2. Phone Adaptive (PA) Normalization

    An alternative and equally valid technique for constructingspeaker scores is to combine phone dependent and phone in-dependent speaker model probabilities. In this scenario, thespeaker-dependent phone-dependent models can be interpo-lated with a speaker-dependent phone-independent model (i.e.,a global GMM) for that speaker. Analytically, the speaker score

    can be described as

    Y (X , S ) = 1|X |

    x X

    log S, ( x ) p(x |S, (x ))

    p(x | (x ))

    +(1 S, ( x ) ) p(x |S )

    p(x )(3)

    Here, S, ( x ) has the same interpretation as before. The ratio-nale behind this approach is to bias the speaker score towardsthe global speaker model when little phone-specic enrollmentdata is available. In the limiting case, this approach falls back to scoring with a global GMM model when the system encoun-ters phonetic units that have not been observed in the speakersenrollment data. This is intuitively more satisfying than thespeaker adaptive approach, which backs off directly to the neu-tral score of zero when a phonetic event is unseen in the enroll-ment data.

    4. Training ApproachesOne of the basic assumptions of the ASR-dependent speaker

    identication approach is that knowledge about the phoneticcontent of an utterance X can be gained by performing the as-signment X (X ) through automatic speech recognition.If the phonetic recognition error rate is low enough, then wecan assume (X ) (X ). Under this assumption, we canuse (X ), as derived from manual transcriptions, when train-ing our speaker models. This is the approach we took in ourpreviously reported experiments.

    Alternatively, instead of using (X ), we could use auto-matically derived transcriptions (X ) for each enrollment ut-terance. In order to compare the effect of training from assign-ments derived from (X ) versus (X ), we trained speakermodels using transcriptions automatically generated by thesame speech recognizer used during testing. For the remain-der of this paper, we will refer to the speaker models trainedfrom manual transcriptions as manually trained (MT) modelsand the speaker models trained from automatic transcriptionsas automatically trained (AT) models.

    5. Experimental Conditions5.1. Corpus Description

    We conducted our experiments using a corpus of speaker-labeled data collected using the MIT MERCURY airline travelinformation system [7] and the MIT ORION task delegation sys-tem [6]. The 44 most frequent registered users of MERCURYand ORION were selected to represent the set of known users.Each of these users spoke a minimum of 48 utterances in thecalls representing the training set for our experiments. As wouldbe expected in real-world applications, the amount of enroll-ment data available for the known users varied greatly basedon the frequency with which they used the systems. Of the 44speakers, 15 had less than 100 utterances available for train-ing, 19 had between 100 and 500 enrollment utterances, and 10speakers had more than 500 utterances for enrollment. Withinthe 44 speakers, 21 were females and 23 were males.

    For the test set, all calls made to the MERCURY systemduring a 10-month span were held out for our evaluation set.Only calls containing at least ve utterances were included inthe evaluation set. The evaluation set was further broken downinto two sub-sets: a set of 304 calls containing 3705 utterancefrom members of the set of 44 known users, and a set of 238

  • 7/28/2019 10.1.1.202.6317

    3/4

    calls containing 2946 utterances from speakers outside of theknown speaker set. Each call had a variable number of utter-ances with an average of 12 utterances per call (stdio=6 utts)and an average utterance duration of 2.3 sec (stdev=1.4 sec).

    Within the enrollment and evaluation data, the rst two ut-terances of each call are typically restricted to be the callersuser name and date password , the remainder of each call is usu-ally comprised of unconstrained user requests mixed with userresponses to system initiated prompts. These utterances canbe highly variable in their length and phonetic content, rang-ing from simple one word utterances (such as yes or no)to lengthy requests for ight information (e.g., I need a ightfrom Boston to Miami on United next Tuesday afternoon.).

    5.2. ASR Acoustic Modeling

    For the ASR-dependent speaker identication approaches de-scribed in this paper, the segment-based SUMMIT system wasused for speech recognition [2]. The recognizer performedword recognition using a vocabulary and language model de-rived specically for whichever system was being used (i.e.,either MERCURY or ORION ). For these experiments, the rec-ognizers feature vectors (i.e., the acoustic observations X )were derived at acoustic landmarks hypothesized to be po-tential phonetic segment boundaries. These feature vectorswere constructed by concatenating 14-dimension mean normal-ized MFCC vectors averaged over eight different segments sur-rounding each landmark. Principal components analysis wasthen used to reduce the dimensionality of these feature vectorsto 50 dimensions. The acoustic model set scored these land-marks using 1388 different context-dependent models. It shouldbe noted that the general principles described in this paper arenot specic to segment-based recognition and are applicable toframe-based speech recognition systems as well.

    6. Results6.1. Evaluation Scenarios

    For our experiments we have examined both the closed-setspeaker identication and speaker verication problems. Be-cause our data is collected via individual calls to our system,we can evaluate speaker identication at both the utterance-level and the call-level. In our case the utterance-level evalu-ation could be quite challenging because any single utterancecould be quite short (such as the single word utterance no) orill-matched to the callers models (as might be the case if thecaller uttered a new city name not observed in his/her enroll-ment data).

    In many applications, it is acceptable to delay the decisionon the speakers identity for as long as possible in order to col-lect additional evaluation data. For example, when booking aight, the system could continue to collect data while the calleris browsing for ights, and delay the decision on the speakersidentity until the caller requests a transaction requiring security,such as billing a reservation to a credit card. To simulate thisstyle of speaker identication, we can evaluate the system usingall available utterances from each call in the evaluation data.

    6.2. Comparison of normalization schemes

    We performed several experiments. First, we compared theperformances of the two normalization approaches on the task of closed-set speaker identication using the 3705 in-set utter-ances. The identication error rates are shown in Table 1. We

    Amount of Speaker ID Error Rate (%)Enrollment Data SA Norm PA Norm

    Max 50 utts 26.4 22.5Max 100 utts 18.4 15.9All available 9.6 9.6

    Table 1: Closed-set speaker identication error rates on indi-vidual utterances for different amounts of enrollment data perspeaker for speaker adaptive vs. phone adaptive normalization.

    0.5 1 2 5 10 20 40 60 800.5

    12

    5

    10

    20

    40

    60

    80

    False Accept probability (%)

    F a l s e

    R e j e c

    t p r o

    b a b i l i t y

    ( % )

    Speaker Adaptive: UttSpeaker Adapative: CallPhone Adaptive: UttPhone Adaptive: Call

    Figure 1: DET curves showing false rejection probability versusfalse accept probability for speaker adaptive vs. phone adaptivenormalization.

    see that using the full amount of enrollment data per speaker,

    both techniques perform equally well. On limited enrollmentdata per speaker, the phone-adaptive normalization approachperforms better. This is presumably because it retains the abil-ity to discriminate between speakers even when there are manyinstances of sparsely trained phonetic units.

    For the speaker verication task, we used the 2946 out-of-set utterances to perform same-gender imposter trials (i.e.,each utterance was only used as an imposter trial against refer-ence speakers of the same gender). The detection error trade-off (DET) curve of the two approaches is shown in Figure 1. Forall four curves, the models were trained on all available data.From theregion of low false acceptancesthrough theequal errorrate region of the DET curve, the two normalization techniqueshave very similar performances. However, in the permissiveoperating point region with low false rejection rates, the phone-adaptive approach has signicantly lower false acceptance ratesthan the speaker adaptive approach. This observation is impor-tant for a conversational dialogue system where convenience tofrequent users is a factor. For example, if we want to maxi-mize convenience by ensuring that only 1% of true speakers arefalsely rejected, then the speaker adaptive method will result ina 60.3% false acceptance rate, while the phone adaptive methodwill result in a 24.6% false acceptance rate.

    6.3. Comparison of training approaches

    In our previously reported experiments, we used models derivedfrom the manual transcriptions for (X ) of the enrollment data.

  • 7/28/2019 10.1.1.202.6317

    4/4

    0 50 100 150 200 250 30010

    5

    0

    5

    10

    15

    20

    25

    Utterance phone error rate (%)

    S

    p e a k e r

    S c o r e

    Figure 2: Plot of correct speaker discrimination versus utter-ance phone error rate using MT models. Each point representsa single utterance. The horizontal axis shows the phone recog-nition error rate on the utterance. The vertical axis indicates thedifference between the scores of the reference speaker and thebest competitor (negative values indicate identication error).A best-t linear approximation of the data is superimposed.

    Under this condition, a high phonetic error rate within (X ) of the test data would result in a mismatch between the testing andtraining conditions. Therefore, we might expect higher pho-netic error rates to be positively correlated with higher speakeridentication error rates.

    To demonstrate this, we plotted speaker discriminabilityversus phonetic error rate on a per utterance basis in Figure 2.On this graph, the vertical axis indicates the difference betweenthe scores of the reference speaker and the best competitor ona particular utterance, with negative values denoting an identi-

    cation error. The horizontal axis is a measure of speech recog-nition accuracy at the phonetic level, and is given by

    ( (X ), (X )) =# phone errors in (X )

    |(X )|

    As expected, we observe a negative correlation between speakerdiscriminability and phone error rate, which conrms thathigher phone error rates are correlated with poorer speaker dis-criminability (and hence higher speaker identication error).

    It is reasonable to believe that the correlation between pho-netic error rate and speaker error rate could be reduced byachieving a better match between training and testing condi-tions. In this case, using automatically derived transcriptionsduring training should provide a match to the testing conditions.However, when we repeated the above plot using the AT mod-els, we observed a nearly identical distribution of points result-ing in a nearly identical correlation between the phone error rateand speaker score difference. In other words, the speaker iden-tication performance of the AT models was almost identical tothat of the MT models, regardless of the phonetic error rate.

    The overall results of performing closed-set identicationusing models trained with each approach, which are shown inTable 2, conrm our ndings that there is no signicant gainor penalty from using the automatically transcribed enrollmentdata. Although there is no clear benet in speaker identicationaccuracy when using the automatic transcriptions, their use isstill benecial because the speaker models can be trained in an

    Enrollment Speaker ID Error Rate (%)Transcriptions SA Norm PA Norm

    Manual 9.61 9.61Automatic 9.77 9.36

    Table 2: Comparison of closed-set speaker identication errorrates on individual utterances for models trained from manuallyand automatically transcribed enrollment data

    unsupervised fashion (i.e., without requiring manual transcrip-tions of the enrollment data).

    7. ConclusionIn this paper, we addressed the issues of speaker score nor-malization and of using automatically generated transcriptionsfor training speaker models when performing ASR-dependentspeaker identication.

    We found that using a phone-adaptive approach is benecialfor normalizing speaker scores compared to a speaker-adaptive

    approach. Although both methods have similar speaker iden-tication performance, the phone-adaptive method generatesscores that are more stable on speaker verication tasks, yield-ing fewer false acceptances of imposters at permissive operatingpoints where low false rejection of known users is desirable.

    In comparing the models trained from manually and au-tomatically generated transcriptions, we found no signicantdifferences in speaker discriminability between the two ap-proaches. This discovery indicates that we can take an unsu-pervised approach to training speaker models without adverselyaffecting our speaker identication results.

    8. References[1] U. Chaudhari, et al. , Multi-grained modeling with pat-

    tern specic maximum likelihood transformations fortext-independent speaker recognition, IEEE Trans. Sp. Aud. Proc. Vol. 11, No. 2, (Jan. 2003), pp. 100108.

    [2] J. Glass, A probabilistic framework for segment-basedspeech recognition, Computer Speech and Language 17(2003), pp. 137152.

    [3] T. J. Hazen, et al. ,Integration of speaker recognition intoconversational spoken dialogue systems, in Proc. EU- ROSPEECH , Geneva, Switzerland, Sept. 2003, pp. 19611964. 25212524.

    [4] A. Park and T. J. Hazen, ASR dependent techniques forspeaker identication, in Proc. ICSLP , Denver, Colorado,Sept. 2002, pp. 25212524.

    [5] D. A. Reynolds, Speaker identication and vericationusing Gaussian mixture speaker models, Speech Commu-nications , vol. 17, pp. 91108, 1995.

    [6] S. Seneff, C. Chuu and D. S. Cyphers, Orion: From on-line interaction to off-line delegation, in Proc. ICSLP ,Beijing, China, October 2000, pp. 767770.

    [7] S. Seneff and J. Polifroni, Formal and natural languagegeneration in the Mercury conversational system, inSatellite Dialogue Workshop of the ANLP-NAACL Meet-ing , Seattle, WA, April 2000.

    [8] F. Weber, , et al. , Speaker recognition on single- and mul-tispeaker data, Digital Signal Processing , vol. 10, no. 1,pp. 7592, January 2000.