Language modeling for speaker recognition Dan Gillick January 20, 2004

Language modeling for speaker recognition

Dan Gillick

January 20, 2004

January 20, 2004 Language modeling for speaker recognition Dan Gillick (2)

Outline

• Author identification

• Trying to beat Doddington’s “idiolect” modeling strategy (speaker recognition)

• My next project


Author ID (undergrad. thesis)

Problem: – train models for each of k authors– given some test text written by 1 of those

authors, identify the correct author

Variations:– different kinds of models– different size test samples– different k


Character n-gram models

What?– 27 tokens: a-z, <space>– some text generated from such a trigram model:

“you orthad gool of anythilly

uncand or prafecaustiont and to hing that put ably”



Why?– very simple– data sparseness less troublesome than with

word n-grams– supposed to be state-of-the-art or at least close

to it (Khmelev, D, Tweedie, F.J. “Using Markov Chains for the Identification of Writers”: Literary and Linguistic Computing, 16(4): 299-307. 2001.)


Character n-grams: Setup

• task: pick correct author from 10 possible authors

• training data: 3 novels for each author

• test data: text from a held-out novel

• jack-knifing: 4 novels for each of 20 authors


Character n-grams: Results• task: picking 1 author from 10 possible authors

• training data size: 3 novels



Why does it work?– captures some word choice information– picks up word endings (–ing, -tion, -ly, etc.)– not hurt much by data sparseness issues


Key-list models

Incentive:– ought to be able to beat character n-grams– develop a new modeling method more focused

on that which differentiates between authors (characters and words are both useful for topic recognition, but that doesn’t mean they are best for author recognition)


Key-list models

Idea:– convert the text stream into a stream of only

authorship-relevant symbols (I called these lists of symbols key-lists)

– each symbol is a regular expression to allow for broad definitions (/*tion/ captures any nounification)

– text not accounted for by the key-list is represented by <short>, <med>, or <long> markers

– build n-gram models from these new streams


Key-list models

sample trigram: <comma> <short> <period>

Regular Expression Description

(\w)(,)(\s) comma

(\w)(\.)(\s) period

(\b)(of|for|to|around|after| … )(\b) common prepositions

(\b)(was|were \w*ed(\b) passive voice

(\b)(is|was|will|are|were|am)(\b) is conjugations

(\b)(\w*ing)(\b) ends in –ing

(\b)(\w*ly)(\b) adverb

(\b)(and|but|or|not|if|then|else)(\b) logical

(\b)(as)(\b) as

(\b)(would|should|could)(\b) modal verbs

Sample key-list:


Key-list models: Results• task: picking 1 author from 10 possible authors

• training data size: 3 novels


Key-list models: Results

Some other interesting results:– key-lists with just punctuation (as well as

<short>, <med>, <long>) performed almost as well as the best key-lists

– all key-lists were outperformed by the best n-letter model when test data size < 10,000 chars. but all key-list models eventually surpassed the n-letter models


Key-list models

Things I didn’t do:– vary amount of training data– spend a long time trying different key-lists– combine key-list results with each other or with

the character results– a lot of other stuff

The thesis is available on the web: http://www.dgillick.com/resource/thesis.pdf


Outline


• Trying to beat Doddington’s “idiolect” modeling strategy (speaker recognition)

• My next project


G. Doddington’s LM strategy

• create LMs with a limited vocabulary of the most commonly occurring 2000 bigrams

• to smooth out zeroes, boost each bigram prob. by 0.001

• score by calculating:

logprob(test|target) – logprob(test|bkg)

• logprobs are joint probabilitieslogprob(AB) = logprob(A) + logprob(B|A)


G. Doddington’s LM: Setup

Switchboard 1 data:– collected in early ’90s from all over the US– 2,400 (~5 min.) conversations among 543 speakers– corpus divided into 6 splits and tested using jack-knifing

through the splits– manual transcripts provided by MS. State

Task:– 8 conversation sides used as training data to build models

for each target speaker– 1 conversation side used as test data– background model built from 3 splits of held-out data– jack-knifing allowed for almost 10,000 trials


G. Doddington’s LM: Results

Notes:– these results are my own

attempt to replicate the original experiments

– SRI reported EER = 8.65% for this same experiment


Adapted bigram models

Incentive:– adapting target models from a much larger

background model should yield better estimates of probabilities in the language models

Specifically:– use same 2000 bigram vocabulary– target probabilities are a mixture of training

probabilities and background probabilities– mixture weight is 2:1 target data:bkg. data


Adapted bigram models: Results

Notes:– nearly identical performance

– combination of the 2 systems yields almost no improvement

– why isn’t the adapted version better?


Can anything improve on 8.68?

Trigrams?– use same count threshold to make a list of the

top 700 trigrams (“a lot of”, “I don’t know” were among the most common)

Character models?– worked well for authorship…– included all character combinations (no limited

vocabulary)– tried bigram and trigram models


Scores and combinationsadapt. word bigrams

EER = 8.89%adapt. word trigrams

EER = 11.88%adapt.char. bigrams

EER = 13.73%adapt. char. trigrams

EER = 17.92%

adapted wordsEER = 8.46%

adapted words + adapted charactersEER = 7.89%

adapted charactersEER = 13.24%

GD bigramsEER = 8.68%


Final Comparison


What about less training data?

1 conversation-side training– character models might provide more of an

advantage with less data?– not so.

• GD EER = 22.5%• adapted character EER = 30%• adapted word EER = 20%

– maybe these character models pick up on the topic of that 1 conversation

– haven’t tried any other size training data


Outline


• Trying to beat GD’s result

• My next project


Key-lists for speaker recognition

• key-list n-grams picked up on phrasing (comma and period were valuable tokens)– automatic transcripts don’t have punctuation

but they do have pause and duration information

• use reg. exps. and duration info. to capture idiosynchratic speaker phrasing

• capture other speech information in key-lists? (energy, f0, etc.)


Acknowledgements

Thanks to:

Anand and Luciana at SRI for trying to help me replicate their results

Barbara for providing advice

Barry and Kofi for helping with computers and stuff

George

Documents

Language modeling for speaker recognition Dan Gillick January 20, 2004