ib_rkl

KL Realignment for Speaker Diarization withMultiple Feature Streams

Deepu Vijayasenan, Fabio Valente and Herv e Bourlard

Presented By: John Dines

IDIAP Research Institute

Martigny, CH

Interspeech 2009 – p. 1/16

Speaker Diarization

Speaker diarization determines “Who spoke when” inan audio stream

Agglomerative clusteringInitialized with an overdetermined number ofspeaker modelsAt each step, two most similar speaker models aremerged according to a distance criterion

Conventional systems use an ergodic HMMSpeaker model – an HMM state with minimumdurationGaussian Mixture Models for state emissionprobabilities


Multistream Diarization

Separate models built for individual feature streams

Linear combination of individual log likelihoods

Individual features might posses diverse statisticsdifferent likelihood range, dimensionality etc.

0

2000

4000

6000

8000

10000

12000

2 2

16

CMU_20050912

16

CMU_20050914

7 7

10

4 3

−−

Neg

ativ

e lo

g lik

elih

ood

→

TDOAMFCC


Motivation

The problem of features with diverse statistics isaddressed in an ad-hoc manner

eg: In [Pardo’06] Gaussian components is initializedas one for TDOA and five for MFCC features

We had proposed a non parametric approach basedon Information Bottleneck principle

clustering is based on posteriors of a backgroundGMM model

How to use posterior features for a better multistreamspeaker diarization?


Overview

IB Principle

Speaker Diarization using IB

Feature combination

KL Realignment


IB Principle

X be a set of elements to cluster into set of clusters Cand Y be set of variables of interest

Assume the conditional distribution p(y|x) is available∀x ∈ X,∀y ∈ Y

IB Principle states that the cluster representation C

should preserve as much information as possible aboutY while minimizing the distortion between C and X.

The objective is to maximizeI(Y,C) − βI(X,C)


IB Optimization

IB criterion is optimized by an agglomerative approachInitialized with |X| clustersIteratively merge the clusters that result in theminimum loss in the objective functionA stopping criterion based on Normalized Mutualinformation determines the number of clusters(Threshold on I(Y,C)

I(Y,X))


IB based Speaker Diarization

Components of a background GMM used as relevancevariables

Realignment used to smooth spkr boundaries of IBsystem output

��

��

��

��

aIB Realign

p(y|x)

(background GMM)

Y

X(audio features)

Diarization

Output


Feature Combination

Individual Posterior streams are combined to get asingle posterior stream. ie., p(y|x) =

∑i p(y|x,MFi

)P iF ,

whereMFi

– background GMM for feature Fi

P iF – prior probability of the stream Fi

Uses only posterior features – problems in combininglikelihoods are eliminated

��

��

��

��

��

��

��

��

��

��

Y

X

p(y|x)

Y

X

p(y|x)

aIB RealignDiarization

Output


Realignment

Conventionally, Viterbi realignment based on a HMM/GMMimproves speaker boundaries

Incorporates a minimum duration constraint on speaker duration

Spkr N

Spkr 2

Spkr 1

Linear combination of log likelihoods for multiple features – mightnot scale if features have diverse statistics


KL based Realignment

Maximization of the mutual information in IB is equivalent tominimization of:

arg minc

∑

t

KL(p(Y |xt)||p(Y |ct))

p(Y |xt) – postr distrbn of relevance variables in a framep(Y |ct) – postr distrbn of relevance variables of a cluster

In order to incorporate the minimum duration constraint, weextend this objective function as :

arg minc

∑

t

[KL(p(Y |xt)||p(Y |ct)) − log(actct+1)]

acicj– transition probability from cluster ci to cj


KL based Realignment II

Can be solved using a EM algorithm, and thus canperform realignment based on posterior features

The re-estimation formula becomesp(y|ci) = 1

p(ci)

∑xt:xt∈ci

p(y|xt)p(xt)

Linear combination of the posteriors can be used incase of multistream diarization

No additional computational effort with additionalfeatures


Evaluation

Evaluation performed on RT06 Nist Evaluation data forMeeting Diarization Task

Since same speech/no-speech referencesegmentation is used speaker error is used as theevaluation measure

Explored the combination of MFCC and TDOA features

Individual feature weights are empirically determinedfrom a development dataset

estimated weights are ( (PMFC , PDEL) = (0.7, 0.3)as compared to (0.9, 0.1) in likelihood basedcombination


Results on RT06 Eval data

w/o HMM/ KLFeature realign GMM basedMFCC 19.3 15.7 15.7TDOA 24.4 25.5 23.9

MFCC+TDOA 11.6 10.7 9.9

Both Realignment systems perform equally well withMFCC

KL realignment performs better with TDOA features(variable feature dimension across meetings) and withthe feature combination


Conclusions

Proposed a KL divergence based realignment schemethat operates only on the a set of posterior features

The system provides same performance asconventional HMM/GMM when tested on a singlefeature stream (MFCC)

The KL based realignment system performs better(9.9%) than conventional HMM/GMM realignment(10.7%) when tested on multiple feature stream(MFCC+TDOA)

The system currently being extended to more featurestreams and initial results shows improvement inperformance


T HANKYOU


Documents

ib_rkl