1986_Levinson_Maximum Likelihood Estimation for Multivariate Mixture Observations of Markov Chains

Preview:

Citation preview

8/2/2019 1986_Levinson_Maximum Likelihood Estimation for Multivariate Mixture Observations of Markov Chains

http://slidepdf.com/reader/full/1986levinsonmaximum-likelihood-estimation-for-multivariate-mixture-observations 1/3

8/2/2019 1986_Levinson_Maximum Likelihood Estimation for Multivariate Mixture Observations of Markov Chains

http://slidepdf.com/reader/full/1986levinsonmaximum-likelihood-estimation-for-multivariate-mixture-observations 2/3

308 IEEE TRANSACTIONSON INFORMATIONTHEORY,VOL.IT-32,NO.2,~ARCH1986

INTRODUCTION

In a recently published paper, Liporace [6] derived a methodfor estimating the parameters of a broad class of elli pticallysymmetric probabilistic functions of a Markov chain. Thecorollary to that work presented here was motivated by the desireto use this general technique to mode l the speech signal for whichit is known [4], [8] that, unfortunately, certain of its most usefulparameterizations do not possess he prescribed symmetry. Since

any continuous probability density function can be approximatedarbitrarily closely by a normal mixture [9], it i s reasonable to usesuch constructs to avoid the restrictions imposed by the require-ment of elli ptical symmetry. In thi s correspondence we adapt themethod and proof of [6] to two types of mixture densities.

02~ (0) = 0. Thus a recursive application of Y to somevalue of X converges to a local maximum (or possibly an tion point) of the likelihood functions. Liporace’s result laxed the original requirement of Baum et al. [2] that 5(xstrictly log concave to the requirement that it be strictconcave and/or elliptically symmetric. We will further exteclass of admissible pdf’s to mixtures and products of mixtustrictly log concave and/or elliptically symmetric densities

For the present problem we will show that a suitable ma

Y is given by the following equations:T-l

C a,(i>u,jbj(O,+l)p,+l(j)Zij =quij) = t=l

T-l

NOMENCLATURE c %(9&(i)Throughout this presentation we shall, where possible, adopt

the notation used in [6]. Consider an unobservable n-state Markovchain with state transition matrix A = [ u~~],,~,~. ssociated witheach state j of the hidden Markov chain is a probability densityfunction b,(x), of the observed d-dimensional random vector x.Here we shall consider densities of the form

t=1

Cjk .7(cjk) = ‘=;

t~14i)8,(i) ’

(1)k=l

where m is known; cjk 2 0 for 1 5 j I n, 1 I k 5 m;C;‘,ic,, = 1 for 1 < j I n; and .N(x, p, U) denotes the d-di-mensional normal density function of mean vector p and covari-ance matrix U.

It is convenient then to think of our hidden Markov chains asbeing defined over a parameter manifold A = { zZpp” Q” x gdx @‘}, where .sP is the set of all n x n row-wise stochasticmatrices; Y”’ is the set of all M x n row-wise stochastic matrices;Wd is the usual d-dimensional Euclidean space; and ed is theset of all d x d real symmetric positive definite matrices. Thenfor a given sequence of observations, 0 = 0, ,Oz,. ’ . , Or, of thevector x and a particular choice of parameter values X E A, wecan efficiently evaluate the likelihood function, L,(O), of the

hidden Markov chain by the forward-backward method ofBaum [l].

The forward and backward partial likelihoods, cy,( ) andp,(i), are computed recursively from

and

respectively. The recursion is initialized by setting a,(l) =1, q)(j) = 0 for 2 <j I n and &-(I’) = 1 for 1 I i I n,

whereupon we may write

(4)

for any t between 1 and T - 1.

THEESTIMATIONALGORITHM

The parameter estimation problem is then one of maximizingZi( 0) with respect to X for a given 0. One way to mtimize Z’Ais to use conventional methods of constrained optimization.Liporace, on the other hand, advocates a reestimation techniqueanalogous to that of Ba um et al. [l], [2]. It is essentially amapping Y: A + A with the property that

T

i$k =r(ll/k) = ‘=f ,

c di, k>Nj)I=1

and

f dj? k)bt(j)(q - pjk)(q - pjk)’

if& = 9-( qk) = 1-1

,$ldj9 k)&(j)

forl~i,jIn,lIkImandlIr,sId.

In U-(9)

( db,tfort=l,

for 1 < t 4

(For f ixed k, p,( j, k) is formally identical to p:(j), as definLiporace.)

Proof of the Formu1u.s:Equations (6) and (7) for the rmation of ai, and cjk follow directly from a theorem of

and Sell [3] because the likelihood function ZA(0) given ina polynomial with nonnegative coefficients in the varu,~, ~,~,l s i, j I n,l I k I m.

To prove (8) and (9) our strategy, following Liporace, define an appropriate auxiliary function Q(1,x). This funwill have the property that Q(x, X_> Q(x, A) implies Z’x<sA(0). Further, as a function of h for any fixed X, Q(x, Xhave a unique global maximum given by (6)-(9).

As a first step to derive such a function we exprelikelihood function as a sum over the set, 9, of all staquences S:

8/2/2019 1986_Levinson_Maximum Likelihood Estimation for Multivariate Mixture Observations of Markov Chains

http://slidepdf.com/reader/full/1986levinsonmaximum-likelihood-estimation-for-multivariate-mixture-observations 3/3

IEEE TRANSACTIONS ON INFORMATION THEORY, V OL. IT-32, NO. 2, MARCH 1986 309

Let us partition the likelihood functi on further by choosi ng aparticular sequence, K = (k, , k, , . . . , k,), of mixture densit ies.As in the case of state sequences e denote the set of all mixturesequences s X= { 1,2,. . . , m}? Thus for someparticular K E Xwe can write the joint likelihood of 0, S, and K as

~A(O,S,K)ras-,.,~(~)~LS,I(I)US,~,)CS,~,.14=1

We have now succee ded n partitioning the likelihood function as

Yh(O) = c c g~(o,s,K). (13)SE9KEY

In view of the similarity of t he representation (13) to that of-Ep, in [6], we now define the auxiliary function

Q&x) =CC~~(O,S,K)log~~(O,S,K). (14)S K

When the expressions for Px and 2~ derived from (12) aresubstituted in (14), we get

where Y,~,~,, 0. The innermost summation in (15) is formallyidentical to that used by Liporace in his proof; therefore, theproperties which he demonstrated for hi s auxiliary function withrespect to p and U hold i n our case as well, thus giving us (8)and (9). We may thus conclude that (5) is correct for Y definedby (6)-(9). Furthermore, the parameter separation made explicitin (12)-(15) allows us to apply the same algorithm to mixtures ofstrictly log concave densities and/or,elliptically symmetric densi-ties as treated by Liporace in 161.

DISCUSSION

In [6] Liporace notes that by setting a,, = pj,l I j I n,Vi,the special case of a single mixture can be treated. It is naturalthen to t hink of usi ng a model with n clusters of m states, eachwith a single associated Gaussian density function as a way of

treating the Gaussian mixture problem considered here.The transformation can be accompli shed n the following way.First we expand the state spaceof our n-state model as shown inFig. 1, in which we have adde d states j, through j, for eachstate j in the original Markov chain. Associated with statesJl, 52,‘. ., j,, are distinct Gaussian densities corresponding to t hem terms of the j th Gaussian mixture in our initial formulation.The transitions exiting state j have probabilities equal to thecorresponding mixture weights. State j,, is a distinguished statethat is entered with probability 1 from the other new states, exitsto state j with probabili ty ujj, and generates o observation in sodoing. The transition matrix for this configuration can be writtendown by inspection. A large number of the entries in it will be

Fig. 1. Equivalent M + 2 state configuration for each state with m-term

Gaussian mixture.

zero or unity. As these are unaltered by (6) and (7), they need notbe reestimated. Using this reconfiguration of the state diagram,Liporace’s formulas can be used n case b,(x) is any mixture ofelliptically symmetric densities.

A variant on the Gaussian mixture theme results from usingb,(x) of the form of a product of mixtures,

D m

What we have considered so far is the special case of (16) forD = 1.From the structure of our derivation it is clear that for hidden

Markov chains having densities of t he form (16), reestimationformulas can be derived as before by solving v-,Q(x, X) = 0.Such solutions will yield results quite analogous o (6)-(10). Notethat this case too can be represented as a reconfiguration of thestate diagram.

One numerical difficulty which may be manifest in the meth-ods described is the phenomenon noted by Nadas [7] in whichone or more of the mean vectors converge to a particular ob-servation while the corresponding covariance matrix approachesa singular matrix. Under t hese conditions, PA(O) -+ cc but t hevalue of X is meaningless.A practical, if unedifying, remedy forthis difficulty is to try a different initial A. Alternatively, one candrop the offending term from the mixture since t is only contrib-

uting at one point of A.Finally, we call attention to two minor facets of these al-

gorithms. First, for flexibility in modeling, the number of termsin each mixture may vary with state, so that m i n (1) could aswell be m,. A similar dependenceon dimension results if m in(16) is repl aced by mj,. In either case, the constraints on themixture weights must be satisfied.

Second, for realistic numbers of observati ons, for example,T 2 5000, the reestimation formulas will underflow on any exist-ing computer. The basic scaling mechani sm described n [5] canbe used to alleviate the problem but must be modified to accountfor the fact that the pt/3, product will be missing the tth scalefactor. To divide out the product of scale factors, the tth sum-mand in both numerator and denominator of (7), (9), and (10)must be multiplied by t he missing coefficient.

At this writing numerical experiments based on Monte Carlo

simulations and classification experiments using real speechsig-nals are being conducted. We h ope to report the results of thesestudies upon their completion.

111

PI

[31

[41

[51

161

[71

WI

[91

REFERENCES

L. E. Baum, “An inequalit y and associated maximization technique instatistical estimation for probabilistic functions of a Markov process,” in

Inequalities, vol. III, 0. Shisha, Ed. New York: Academic, 1972, pp.l-8.

L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximizationtechnique occurring in the statistical analysis of probabilistic functions ofMarkov chains” Ann. Murh. Statist., vol. 41, pp. 164-171, 1970.

L. E. Baum and G. R. Sell, “Growth transformations for functions onmanifolds,” Pm. J. Math., vol. 27, pp. 211-227, 1968.

A. H. Gray an d J. D. Markel, “Quantization and bit allocation in speechprocessing,” IEEE Truns. Acourt. Speech Signal Processing, vol. ASSP-24,pp. 459-473, Dec. 1976.

S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An introduction tothe applicati on of the theory of probabilistic functions of a Markov

process to automatic speech recognition,” Bell. Cyst. Tech. J., vol. 62, pp.1035-1074, Apr. 1983.L. R. Liporace, “Maximum likeli hood estimati on for multivariate ob-

servations of Markov sources.” IEEE Truns. Inform. Theory. vol. IT-28.

pp. 129-734, Sept. 1982.A. Nadas, “Hidden Markov chains, the forward-backward algorithm andinitial statistics,” IEEE Truns. Acoust. Speech Signd Processing, vol.

ASSP-31, pp. 504-506, Apr. 1983.L. R. Rabiner, J. G. Wilpon, and J. G. Ackenhusen, “On the effects ofvarying analysis parameters on an LPC-based isolated word recognizer,”Bell Svst. Tech. J.. vol. 60, pp. 893-911, 1981.

H. W. Sorenson and D. L. Alspach, “Recursive Bayesian estimation using

Gaussian sums,” Aut omuticu, vol. 7, pp. 465-479, 1971.

Authorized licensed use limited to: UNIVERSIDADE DE BRASILIA Downloaded on June 23 2009 at 11:23 from IEEE Xplore Restrictions apply

Recommended