Springer Handbook of Speech Processing || A Machine Learning Framework for Spoken-Dialog Classification

585

A Machine Le29. A Machine Learning Frameworkfor Spoken-Dialog Classification

C. Cortes, P. Haffner, M. Mohri

One of the key tasks in the design of large-scaledialog systems is classification. This consists ofassigning, out of a finite set, a specific cate-gory to each spoken utterance, based on theoutput of a speech recognizer. Classification ingeneral is a standard machine-learning prob-lem, but the objects to classify in this particularcase are word lattices, or weighted automata, andnot the fixed-size vectors for which learning al-gorithms were originally designed. This chapterpresents a general kernel-based learning frame-work for the design of classification algorithmsfor weighted automata. It introduces a family ofkernels, rational kernels, that combined with sup-port vector machines form powerful techniquesfor spoken-dialog classification and other clas-sification tasks in text and speech processing. Itdescribes efficient algorithms for their computationand reports the results of their use in several

29.1 Motivation ........................................... 585

29.2 Introduction to Kernel Methods ............. 586

29.3 Rational Kernels ................................... 587

29.4 Algorithms ........................................... 589

29.5 Experiments ......................................... 591

29.6 Theoretical Results for Rational Kernels .. 593

29.7 Conclusion ........................................... 594

References .................................................. 595

difficult spoken-dialog classification tasks basedon deployed systems. Our results show that ratio-nal kernels are easy to design and implement, andlead to substantial improvements of the classifi-cation accuracy. The chapter also provides sometheoretical results helpful for the design of rationalkernels.

29.1 Motivation

A critical problem for the design of large-scale spoken-dialog systems is to assign a category, out of a finite set,to each spoken utterance. These categories help guide thedialog manager in formulating a response to the speaker.The choice of categories depends on the application, theycould be for example referral or pre-certification fora health-care company dialog system, or billing servicesor credit for an operator-service system.

��

��

� ��

��

��

��

��

��

�

� ��

��

��

��

��

�

��

��

��

��

��

Fig. 29.1 Word lattice output of a speech recognition system for the spoken utterance “Hi, this is my number”

To determine the category of a spoken utterance, oneneeds to analyze the output of a speech recognizer. Fig-ure 29.1 is taken from a customer-care application. Itillustrates the output of a state-of-the-art speech recog-nizer in a very simple case where the spoken utteranceis “Hi, this is my number.” The output is an acyclicweighted automaton called a word lattice. It compactlyrepresents the recognizer’s best guesses. Each path is la-

PartE

29

586 Part E Speech Recognition

beled with a sequence of words and has a score obtainedby summing the weights of the constituent transitions.The path with the lowest score is the recognizer’s bestguess, in this case: I’d like my card number.

This example makes evident that the error rate ofconversational speech recognition systems is still toohigh in many tasks to rely only on the one-best outputof the recognizer. Instead, one can use the full wordlattice, which contains the correct transcription in mostcases. This is indeed the case in Fig. 29.1, since the toppath is labeled with the correct sentence. Thus, in thischapter, spoken-dialog classification is formulated as theproblem of assigning a category to each word lattice.

Classification in general is a standard machine-learning problem. A classification algorithm receivesa finite number of labeled examples which it uses fortraining, and selects a hypothesis expected to make fewerrors on future examples. For the design of modernspoken-dialog systems, this training sample is oftenavailable. It is the result of careful human labeling ofspoken utterances with a finite number of predeterminedcategories of the type already mentioned.

However, most classification algorithms were orig-inally designed to classify fixed-size vectors. Theobjects to analyze for spoken-dialog classification areword lattices, each a collection of a large numberof sentences with some weight or probability. Howcan standard classification algorithms such as supportvector machines [29.1] be extended to handle suchobjects?

This chapter presents a general framework and so-lution for this problem, which is based on kernelsmethods [29.2, 3]. Thus, we shall start with a brief in-troduction to kernel methods (Sect. 29.2). Section 29.3will then present a kernel framework, rational kernels,that is appropriate for word lattices and other weightedautomata. Efficient algorithms for the computation ofthese kernels will be described in Sect. 29.4. We also re-port the results of our experiments using these methodsin several difficult large-vocabulary spoken-dialog clas-sification tasks based on deployed systems in Sect. 29.5.There are several theoretical results that can guide thedesign of kernels for spoken-dialog classification. Theseresults are discussed in Sect. 29.6.

29.2 Introduction to Kernel Methods

Let us start with a very simple two-group classificationproblem illustrated by Fig. 29.2 where one wishes todistinguish two populations, the blue and red circles. Inthis very simple example, one can choose a hyperplaneto separate the two populations correctly, but there areinfinitely many choices for the selection of that hyper-plane. There is good theory though supporting the choiceof the hyperplane that maximizes the margin, that is thedistance between each population and the separating hy-perplane. Indeed, let F denote the class of real-valuedfunctions on the ball of radius R in �

N :

F = {x �→ w · x : ‖w‖ ≤ 1, ‖x‖ ≤ R} . (29.1)

Then, it can be shown [29.4] that there is a constant csuch that, for all distributions D over X, with probabil-ity at least 1− δ, if a classifier sgn( f ), with f ∈ F , hasmargin at least ρ over m independently generated train-ing examples, then the generalization error of sgn( f ), orerror on any future example, is no more than

c

m

(R2

ρ2log2 m + log

1

δ

). (29.2)

This bound justifies large-margin classification algo-rithms such as support vector machines (SVMs). Letw · x +b = 0 be the equation of the hyperplane, where

w ∈ �N is a vector normal to the hyperplane and b ∈ �

a scalar offset. The classifier sgn(h) corresponding to thishyperplane is unique and can be defined with respect tothe training points x1, . . . , xm :

h(x) = w · x +b =m∑

i=1

αi (xi · x)+b , (29.3)

where the αi are real-valued coefficients. The main pointwe are interested in here is that, both for the construction

��

Fig. 29.2a,b Large-margin linear classification. (a) An ar-bitrary hyperplane can be chosen to separate the two groups.(b) The maximal-margin hyperplane provides better theo-retical guarantees

PartE

29.2

A Machine Learning Framework for Spoken-Dialog Classification 29.3 Rational Kernels 587

Fig. 29.3a,b Nonlinearly separable case. The classification task consists of discriminating between the solid squares andsolid circles. (a) No hyperplane can separate the two populations. (b) A nonlinear mapping can be used instead

of the hypothesis and the later use of that hypothesisfor classification of new examples, one needs only tocompute a number of dot products between examples.

In practice, nonlinear separation of the training datais often not possible. Figure 29.3a shows an examplewhere any hyperplane crosses both populations. How-ever, one can use more-complex functions to separatethe two sets as in Fig. 29.3b. One way to do that is to usea nonlinear mapping Φ : X → F from the input space Xto a higher-dimensional space F where linear separationis possible.

The dimension of F can truly be very large in prac-tice. For example, in the case of document classification,one may use as features, sequences of three consecu-tive words (trigrams). Thus, with a vocabulary of just100 000 words, the dimension of the feature space F is1015. On the positive side, as indicated by the error boundof (29.2), the generalization ability of large-margin clas-sifiers such as SVMs does not depend on the dimensionof the feature space but only on the margin ρ and thenumber of training examples m. However, taking a largenumber of dot products in a very high-dimensional spaceto define the hyperplane may be very costly.

A solution to this problem is to use the so-calledkernel trick or kernel methods. The idea is to definea function K : X × X → � called a kernel, such that the

kernel function on two examples x and y in input space,K (x, y), is equal to the dot product of two examplesΦ(x) and Φ(y) in feature space:

∀x, y ∈ X, K (x, y) = Φ(x) ·Φ(y) . (29.4)

K is often viewed as a similarity measure. A crucialadvantage of K is efficiency: there is no need any-more to define and explicitly compute Φ(x), Φ(y), andΦ(x) ·Φ(y). Another benefit of K is flexibility: K canbe arbitrarily chosen as long as the existence of Φ isguaranteed, which is called Mercer’s condition. Thiscondition is important to guarantee the convergence oftraining for algorithms such as SVMs. Some standardMercer kernels over a vector space are the polyno-mial kernels of degree d ∈ � , Kd(x, y) = (x · y +1)d ,and Gaussian kernels Kσ (x, y) = exp(−‖x − y‖2/σ2),σ ∈ �+ .

A condition equivalent to Mercer’s condition is thatthe kernel K be positive definite and symmetric, that is, inthe discrete case, the matrix [K (xi , x j )]1≤i, j≤n must besymmetric and positive semidefinite for any choice of npoints x1, . . . , xn in X. Said differently, the matrix mustbe symmetric and its eigenvalues nonnegative. Thus, forthe problem that we are interested in, the question ishow to define positive-definite symmetric kernels forword lattices or weighted automata.

29.3 Rational Kernels

This section introduces a family of kernels for weightedautomata, rational kernels. We will start with some

preliminary definitions of automata and transduc-ers. For the most part, we will adopt the notation

PartE

29.3


��

��

�

��

��

��

��

��

�� !��

!��

�

��! ��

!��

�! ��

�!��

��

Fig. 29.4a,b Weighted automata and transducers. (a) Ex-ample of a weighted automaton A. [[A]](abb) = 0.1 × 0.2 ×0.3 × 0.1+0.5 × 0.3 × 0.6 × 0.1. (b) Example of a weightedtransducer T . [[T ]](abb, baa) = [[A]](abb)

introduced in Chap. 28 for automata and transduc-ers.

Figure 29.4a shows a simple weighted automaton. Itis a weighted directed graph in which edges or transi-tions are augmented with a label and carry some weightindicated after the slash symbol. A bold circle indicatesan initial state and a double circle a final state. A finalstate may also carry a weight indicated after the slashsymbol representing the state number.

A path from an initial state to a final state is calleda successful path. The weight of a path is obtainedby multiplying the weights of constituent transitionsand the final weight. The weight associated by A toa string x, denoted by [[A]](x), is obtained by summingthe weights of all paths labeled with x. In this case, theweight associated to the string ‘abb’ is the sum of theweight of two paths. The weight of each path is ob-tained by multiplying transition weights and the finalweight 0.1.

Similarly, Fig. 29.4b shows an example ofa weighted transducer. Weighted transducers are similarto weighted automata but each transition is augmentedwith an output label in addition to the familiar input la-bel and the weight. The output label of each transitionis indicated after the colon separator. The weight asso-ciated to a pair of strings (x, y), denoted by [[T ]](x, y), isobtained by summing the weights of all paths with inputlabel x and output label y. Thus, for example, the weightassociated by the transducer T to the pair (abb, baa) isobtained as in the automata case by summing the weightsof the two paths labeled with (abb, baa).

To help us gain some intuition into the definition ofa family of kernels for weighted automata, let us first

consider kernels for sequences. As mentioned earlierin Sect. 29.2, a kernel can be viewed as a similaritymeasure. In the case of sequences, we may say that twostrings are similar if they share many common substringsor subsequences. The kernel could then be for examplethe sum of the product of the counts of these commonsubstrings. But how can we generalize that to weightedautomata?

Similarity measures such as the one just discussedcan be computed by using weighted finite-state transduc-ers. Thus, a natural idea is to use weighted transducersto define similarity measures for sequences. We will saya that kernel K is rational when there exists a weightedtransducer T such that

K (x, y) = [[T ]](x, y) , (29.5)

for all sequences x and y. This definition generalizesnaturally to the case where the objects to handle areweighted automata. We say that K is rational whenthere exists a weighted transducer T such that K (A, B),the similarity measure of two automata A and B, is givenby:

K (A, B) =∑x,y

[[A]](x) · [[T ]](x, y) · [[B]](y) . (29.6)

T (x, y) is the similarity measure between two strings xand y. But, in addition, we need to take into account theweight associated by A to x and by B to y, and sumover all pairs (x, y). This definition can be generalizedto the case of an arbitrary semiring where general op-erations other than the usual sum and multiplication areapplied, which has important theoretical and algorithmicconsequences and also interesting software-engineeringimplications [29.5].

Our definition of kernels for weighted automata isa first step towards extending SVMs to handle weightedautomata. However, we also need to provide efficientalgorithms for computing the kernels. The weighted au-tomata that we are dealing with in spoken-dialog systemsare word lattices that may have hundreds of thousandsof states and transitions, and millions of paths, thus theefficiency of the computation is critical. In particular,the computational cost of applying existing string ker-nel algorithms to each pair of paths of two automata Aand B is clearly prohibitive.

We also have to provide effective kernels for spoken-dialog classification and prove that they are indeedpositive-definite symmetric so they can be combinedwith SVM for high-accuracy classification systems.

PartE

29.3

A Machine Learning Framework for Spoken-Dialog Classification 29.4 Algorithms 589

29.4 Algorithms

This section describes general, efficient algorithms forthe computation of rational kernels between weightedautomata and provide examples of effective pos-itive definite symmetric kernels for spoken-dialogclassification.

The outline of the algorithm is as follows. The mainobservation is that the sum defining rational kernels canbe viewed as the sum of the weights of all the paths ofa single transducer, obtained by composing A with Twith B (see Chap. 28):

∑x,y

[[A]](x)[[T ]](x, y)[[B]](y)

=∑x,y

[[A ◦ T ◦ B]](x, y) . (29.7)

The sum of the weights of all the paths of a transducercan be computed using a general single-source shortest-distance algorithm over the ordinary (+, ×) semiring orthe so-called forward–backward algorithm in the acycliccase [29.6]. Thus, this leads to the following generalalgorithm to compute K (A, B) when K is a rationalkernel associated to the transducer T .

• Use the composition algorithm (see Chap. 28) tocompute U = A ◦ T ◦ B in time O(|T ||A||B|).• Use a general single-source shortest-distance algo-rithm to compute the sum of the weights of allsuccessful paths of U [29.6]. This can be done inlinear time (O(|U|)) when U is acyclic, which isthe case when A and B are acyclic automata (wordlattices).

In combination, this provides a general, efficientalgorithm for computing rational kernels whose to-tal complexity for acyclic word lattices is quadratic:O(|T ||A||B|). Here |T | is a constant independent ofthe automata A and B for which K (A, B) needs to becomputed.

!��!��

�"!"��

��

��

!��!��

��!��

��

��

!��!��

!��!��

� ! ��

�!�� ! ��

��!��

��

��

!��!��

!��!��

� ! ��

�!�� ! ��

!��!��

Fig. 29.5a–c Weighted transducers computing the expected counts of (a) all sequences accepted by the regular expressionX; (b) all bigrams; (c) all gappy bigrams with penalty gap λ, over the alphabet {a, b}

The next question that arises is how to define andcompute kernels based on counts of substrings or sub-sequences as previously discussed in Sect. 29.3. Forweighted automata, we need to generalize the notionof counts to take into account the weight of the paths ineach automaton and use instead the expected counts ofa sequence.

Let |u|x denote the number of occurrences of a sub-string x in u. Since a weighted automaton A definesa distribution over the set of strings u, the expectedcount of a sequence x in a weighted automaton A canbe defined naturally as

cA(x) =∑

u∈Σ∗|u|x [[A]](u) , (29.8)

where the sum runs over Σ∗, the set of all strings over thealphabet Σ. We mentioned earlier that weighted trans-ducers can often be used to count the occurrences ofsome substrings of interest in a string. Let us assumethat one could find a transducer T that could computethe expected counts of all substrings of interest appear-ing in a weighted automaton A. Thus, A ◦ T wouldprovide the expected counts of these substrings in A.Similarly, T−1 ◦ B would provide the expected countsof these substrings in B. Recall that T−1 is the trans-ducer obtained by swapping the input and output labelsof each transition Chap. 28. Composition of A ◦ T andT−1 ◦ B matches paths labeled with the same substringin A ◦ T and T−1 ◦ B and multiplies their weights. Thus,by definition of composition (see Chap. 28), we couldcompute the sum of the expected counts in A and B ofthe matching substrings by computing

A ◦ T ◦ T−1 ◦ B , (29.9)

and summing the weights (expected counts) of all pathsusing a general single-source shortest-distance algo-rithm.

PartE

29.4


��!�

�

!��!�

!��!�

!

!��!��

�! �!�

�!�

�!

�! ��

�!� !

�

�!��

�!��

�! ��

�!�

!

�!�

!

�

�! �!�

�!��!

Fig. 29.6 Simple weighted trans-ducer corresponding to the gappybigram kernel used by [29.7] obtainedby composition of the transducer ofFig. 29.5c and its inverse

Comparing this formula with (29.7) shows that thecount-based similarity measure we are interested in isthe rational kernel whose corresponding weighted trans-ducer S is:

S = T ◦ T−1 . (29.10)

Thus, if we can determine a T with this property, thiswould help us naturally define the similarity measure S.In Sect. 29.6, we will prove that the kernel correspondingto S will actually be positive-definite symmetric, and canhence be used in combination with SVMs.

In the following, we will provide a weighted trans-ducer T for computing the expected counts of substrings.It turns out that there exists a very simple transducer Tthat can be used for that purpose. Figure 29.5a showsa simple transducer that can be used to compute the ex-pected counts of all sequences accepted by the regularexpression X over the alphabet {a, b}. In the figure, thetransition labeled with X : X/1 symbolizes a finite au-tomaton representing X with identical input and outputlabels and all weights equal to one.

Here is how transducer T counts the occurrences ofa substring x recognized by X in a sequence u. State 0reads a prefix u1 of u and outputs the empty string ε.Then, an occurrence of x is read and output identically.Then state 1 reads a suffix u2 of u and outputs ε. Inhow many different ways can u be decomposed intou = u1 x u2? In exactly as many ways as there are oc-currences of x in u. Thus, T applied to u generatesexactly |u|x successful paths labeled with x. This holdsfor all strings x accepted by X. If T is applied to (orcomposed with) a weighted automaton A instead, thenfor any x, it generates |u|x paths for each path of A la-beled with u. Furthermore, by definition of composition,each path generated is weighted with the weight of thepath in A labeled with u. Thus, the sum of the weightsof the paths generated that are labeled with x is exactly

the expected count of x:

[[A ◦ T ]](x) = cA(x) . (29.11)

Thus, there exists a simple weighted transducer T thatcan count, as desired, the expected counts of all stringsrecognized by an arbitrary regular expression X ina weighted automaton A. In particular, since the set ofbigrams over an alphabet Σ is a regular language, wecan construct a weighted transducer T computing theexpected counts of all bigrams. Figure 29.5b shows thattransducer, which has only three states regardless of thealphabet size.

In some applications, one may wish to allow fora gap between the occurrences of two symbols and viewtwo weighted automata as similar if they share such sub-strings (gappy bigrams) with relatively large expectedcounts. The gap or distance between two symbols ispenalized using a fixed penalty factor λ, 0 ≤ λ < 1.A sequence kernel based on these ideas was used suc-cessfully by [29.7] for text categorization. Interestingly,constructing a kernel based on gappy bigrams is straight-forward in our framework. Figure 29.5c shows thetransducer T counting expected counts of gappy bigramsfrom which the kernel can be efficiently constructed.The same can be done similarly for higher-order gappyn-grams or other gappy substrings.

The methods presented in this section can be usedto construct efficiently and, often in a simple manner,relatively complex weighted automata kernels K basedon expected counts or other ideas. A single, generalalgorithm can then be used as described to computeefficiently K (A, B) for any two weighted automata Aand B, without the need to design a new algorithm foreach new kernel, as previously done in the literature.As an example, the gappy kernel used by [29.7] is therational kernel corresponding to the six-state transducerS = T ◦ T−1 of Fig. 29.6.

PartE

29.4

A Machine Learning Framework for Spoken-Dialog Classification 29.5 Experiments 591

29.5 Experiments

This section describes the applications of the kernelframework and techniques to spoken-dialog classifica-tion.

In most of our experiments, we used simple n-gramrational kernels. An n-gram kernel kn for two weightedautomata or lattices A and B is defined by:

kn(A, B) =∑|x|=n

cA(x) cB(x) . (29.12)

As described in Sect. 29.4, kn is a kernel of the formT ◦ T−1 and can be computed efficiently. An n-gramrational kernel Kn is simply the sum of kernels km , with1 ≤ m ≤ n:

Kn =n∑

m=1

km .

Thus, the feature space associated with Kn is the set ofall m-gram sequences with m ≤ n. As discussed in theprevious section, it is straightforward, using the samealgorithms and representations, to extend these kernelsto kernels with gaps and to many other more-complexrational kernels more closely adapted to the applicationsconsidered.

We did a series of experiments in several large-vocabulary spoken-dialog tasks using rational kernelswith a twofold objective [29.8]: to improve classifica-tion accuracy in those tasks, and to evaluate the impacton classification accuracy of the use of a word latticerather than the one-best output of the automatic speechrecognition (ASR) system.

The first task we considered was that of a de-ployed customer-care application (How may I help you?,HMIHY 0300). In this task, users interact with a spoken-dialog system via the telephone, speaking naturally, toask about their bills, their calling plans, or other simi-lar topics. Their responses to the open-ended promptsof the system are not constrained by the system, theymay be any natural language sequence. The objectiveof the spoken-dialog classification is to assign one orseveral categories or call-types, e.g., billing credit, orcalling plans, to the users’ spoken utterances. The set ofcategories is predetermined, and in this first applicationthere are 64 categories. The calls are classified based onthe user’s response to the first greeting prompt: “Hello,this is AT&T. How may I help you?”

Table 29.1 indicates the size of the HMIHY 0300datasets we used for training and testing. The trainingset is relatively large with more than 35 000 utterances;

��#��

��

��

�

��

�

�

��

��

��

��

��#��

��

�

��

��

�

��#��

��

�

��

��

�

��#��

��

��

�

�

��

��

��

�

�

��

��

��

��#��

��#��

Fig. 29.7a–c Classification error rate as a function of re-jection rate in (a) HMIHY 0300, (b) VoiceTone1, and(c) VoiceTone2

this is an extension of the one we used in our previousclassification experiments with HMIHY 0300 [29.9]. Inour experiments, we used the n-gram rational kernels

PartE

29.5


Table 29.1 Key characteristics of the three datasets used in the experiments. The fifth column displays the total number ofunigrams, bigrams, and trigrams found in the one-best output of the ASR for the utterances of the training set, that is thenumber of features used by BoosTexter or SVMs used with the one-best outputs. The training and testing sizes reportedin columns 3 and 4 are described in the number of utterances, or equivalently the number of word lattices

Dataset Number of Training Testing Number of ASR wordclasses size size n-grams accuracy%

HMIHY 0300 64 35551 5000 24177 72.5

VoiceTone1 97 29561 5537 22007 70.5

VoiceTone2 82 9093 5172 8689 68.8

just described with n = 3. Thus, the feature set we usedwas that of all n-grams with n ≤ 3. Table 29.1 indicatesthe total number of distinct features of this type found inthe datasets. The word accuracy of the system based onthe best hypothesis of the speech recognizer was 72.5%.This motivated our use of the word lattices, which con-tain the correct transcription in most cases. The averagenumber of transitions of a word lattice in this task wasabout 260.

Table 29.1 reports similar information for two otherdatasets, VoiceTone1, and VoiceTone2. These are morerecently deployed spoken-dialog systems in different ar-eas, e.g., VoiceTone1 is a task where users interact witha system related to health-care with a larger set of cate-gories (97). The size of the VoiceTone1 datasets we usedand the word accuracy of the recognizer (70.5%) makethis task otherwise similar to HMIHY 0300. The datasetsprovided for VoiceTone2 are significantly smaller witha higher word error rate. The word error rate is indica-tive of the difficulty of classification task since a highererror rate implies a more noisy input. The average num-ber of transitions of a word lattice in VoiceTone1 wasabout 210 and in VoiceTone2 about 360.

Each utterance of the dataset may be labeled withseveral classes. The evaluation is based on the followingcriterion: it is considered an error if the highest scoringclass given by the classifier is none of these labels.

We used the AT&T FSM library [29.10] and theGRM library [29.11] for the implementation of then-gram rational kernels Kn used. We used these ker-nels with SVMs, using a general learning libraryfor large-margin classification(LLAMA), which of-fers an optimized multiclass recombination of binarySVMs [29.12]. Training time took a few hours on a singleprocessor of a 2.4 GHz Intel Pentium processor Linuxcluster with 2 GB of memory and 512 KB cache.

In our experiments, we used the trigram kernel K3with a second-degree polynomial. Preliminary experi-ments showed that the top performance was reached fortrigram kernels and that 4-gram kernels, K4, did not sig-

nificantly improve performance. We also found that thecombination of a second-degree polynomial kernel withthe trigram kernel significantly improves performanceover a linear classifier, but that no further improvementcould be obtained with a third-degree polynomial.

We used the same kernels in the three datasets pre-viously described and applied them to both the speechrecognizer’s single best hypothesis (one-best results),and to the full word lattices output by the speech rec-ognizer. We also ran, for the sake of comparison, theBoosTexter algorithm [29.13] on the same datasets byapplying it to the one-best hypothesis. This served asa baseline for our experiments.

Figure 29.7a shows the result of our experiments inthe HMIHY 0300 task. It gives classification error rateas a function of rejection rate (utterances for which thetop score is lower than a given threshold are rejected)in HMIHY 0300 for: BoosTexter, SVM combined withour kernels when applied to the one-best hypothesis, andSVM combined with kernels applied to the full lattices.

SVM with trigram kernels applied to the one-besthypothesis leads to better classification than BoosTextereverywhere in the range of 0–40% rejection rate. Theaccuracy is about 2–3% absolute value better than thatof BoosTexter in the range of interest for this task, whichis roughly between 20% and 40% rejection rate. The re-sults also show that the classification accuracy of SVMscombined with trigram kernels applied to word latticesis consistently better than that of SVMs applied to theone-best alone by about 1% absolute value.

Figures 29.7b,c show the results of our experimentsin the VoiceTone1 and VoiceTone2 tasks using the sametechniques and comparisons. As observed previously, inmany regards, VoiceTone1 is similar to the HMIHY 0300task, and our results for VoiceTone1 are comparableto those for HMIHY 0300. The results show that theclassification accuracy of SVMs combined with trigramkernels applied to word lattices is consistently better thanthat of BoosTexter, by more than 4% absolute value ata rejection rate of about 20%. They also demonstrate

PartE

29.5

A Machine Learning Framework for Spoken-Dialog Classification 29.6 Theoretical Results for Rational Kernels 593

�

��

�� !�"

��

�� !�"

��

�� !�"

��

�� !�"

��

��

Fig. 29.8 Weighted transducer T m for computing the m-th moment of the count of an aperiodic substring x. The finalweight at state k, k = 1, . . . , m, indicated after /, is S(m, k), the Stirling number of the second kind, that is the numberof ways of partitioning a set of m elements into k nonempty subsets, S(m, k) = 1

k!∑k

i=0

(ki

)(−1)i (k − i)m . The first-order

transducer T1 coincides with the transducer of Fig. 29.5a, since S(1, 1) = 1

more clearly the benefits of the use of the word latticesfor classification in this task. This advantage is evenmore manifest for the VoiceTone2 task for which thespeech recognition accuracy is lower. VoiceTone2 is alsoa harder classification task, as can be seen by comparingthe plots of Fig. 29.7b. The classification accuracy ofSVMs with kernels applied to lattices is more than 6%absolute value better than that of BoosTexter near the40% rejection rate, and about 3% better than SVMsapplied to the one-best hypothesis.

Thus, our experiments in spoken-dialog clas-sification in three distinct large-vocabulary tasksdemonstrated that using rational kernels with SVMsconsistently leads to very competitive classifiers. Theyalso show that their application to the full word lat-tices instead of the single best hypothesis output bythe recognizer systematically improves classificationaccuracy.

We further explored the use of kernels based on othermoments of the counts of substrings in sequences, gen-eralizing n-gram kernels [29.14]. Let m be a positiveinteger. Let cm

A(x) denote the m-th moment of the countof the sequence x in A defined by:

cmA(x) =

∑u∈Σ∗

|u|mx [[A]](u) . (29.13)

We can define a general family of kernels, denoted byKm

n , n, m ≥ 1 and defined by:

Kmn (A, B) =

∑|x|=n

cmA(x)cm

B (x) , (29.14)

which exploit the m-th moment of the counts of sub-strings x in weighted automata A and B to definetheir similarity. Cortes and Mohri [29.14] showed thatthere exist weighted transducers T m

n that can be usedto compute cm

A(x) efficiently for all n-gram sequencex and weighted automaton A. Thus, these kernels arerational kernels and their associated transducers areT m

n ◦ T mn

−1:

Kmn (A, B) =

∑x,y

[[A◦ [

T mn ◦ (

T mn

)−1]◦B]]

(x, y) .

(29.15)

Figure 29.8 shows the weighted transducer T m1 for

aperiodic strings x, which has only m|x|+1 states.An aperiodic string is a string that does not admita nonempty prefix as a suffix. The m-th moment ofother strings (periodic strings) can also be computedusing weighted transducers. By (29.15), the transducerT m

1 can be used to compute Kmn (A, B) by first computing

the composed transducer A◦[T mn ◦ (T m

n )−1]◦B and thensumming the weights of all the paths of this transducerusing a shortest-distance algorithm [29.6] .

The application of moment kernels to theHMIHY 0300 task resulted in a further improvementof the classification accuracy. In particular, at a 15% re-jection rate, the error rate was reduced by 1% absolute,which is about 6.2% relative, which is significant in thistask, by using variance kernels, that is moment kernelsof second order (m = 2).

29.6 Theoretical Results for Rational Kernels

In the previous sections, we introduced a number of ra-tional kernels, e.g., kernels based on expected countsor moments of the counts and applied them to spoken-

dialog tasks. Several questions arise in relation to thesekernels. As pointed out earlier, to guarantee the con-vergence of algorithms such as SVMs, the kernels used

PartE

29.6


must be positive-definite symmetric (PDS). But, howcan we construct PDS rational kernels? Are n-gram ker-nels and similar kernels PDS? Can we combine simplerPDS rational kernels to create more-complex ones? Isthere a characterization of PDS rational kernels?

All these questions have been investigated by Corteset al. [29.5]. The following theorem provides a generalmethod for constructing PDS rational kernels.

Theorem 29.1Let T be a weighted finite-state transducer over (+, ×).Assume that the weighted transducer T ◦ T−1 is regu-lated, then S = T ◦ T−1 defines a PDS rational kernel.A weighted transducer T is said to be regulated when[[T ]](x, y), the sum of the weights of the paths with inputx and output y, is well defined.

Proof. We give a sketch of the proof. A full proof isgiven in [29.5]. Let K be the kernel associated withS = T ◦ T−1. By definition of T−1 and composition,

∀x, y ∈ X, K (x, y) =∑

z

[[T ]](x, z)[[T ]](y, z) ,

(29.16)

where the sum is over all strings z. Let Kn be the functiondefined by restricting the sum to strings of length atmost k:

∀x, y ∈ X, Kk(x, y)=∑|z|≤k

[[T ]](x, z)[[T ]](y, z) .

(29.17)

Consider any ordering of all strings of length at mostk: z1, . . . , zl . For any set of n strings x1, . . . , xn , let Abe the matrix defined by A = [[[T ]](xi , z j )]i∈[1,n], j∈[1,l].Then, the eigenvalues of the matrix Mn defined by

Mn = [Kn(xi , x j )]i, j∈[1,n] (29.18)

are necessarily nonnegative since Mn = AA�. Thus, forany n ≤ 0, Kn is a PDS kernel. Since K is a pointwiselimit of Kn , K is also PDS [29.5, 15]. �

The theorem shows that the rational kernels we con-sidered in previous sections, e.g., count-based similaritykernels, n-gram kernels, and gappy n-gram kernels, areall PDS rational kernels, which justifies a posteriori theiruse in combination with SVMs. Conversely, we haveconjectured elsewhere that all PDS rational kernels arerational kernels associated to transducers of the typeS = T ◦T−1 [29.5], and proved several results in supportof that conjecture. In particular, for acyclic transducer,this indeed provides a characterization of PDS rationalkernels.

It can also be shown that a finite sum of PDS rationalkernels is a PDS rational kernel, which we used fordefining n-gram kernels. More generally, the followingtheorem holds [29.5].

Theorem 29.2PDS rational kernels are closed under sum, product, andKleene closure.

Thus, one can use rational operations to create com-plex PDS rational kernels from simpler ones.

29.7 Conclusion

Rational kernels form an effective tool and frameworkfor spoken-dialog classification. They are based ona general theory that guarantees in particular the posi-tive definiteness of rational kernels based on an arbitrary(+, ×)-weighted transducer and thus the convergence oftraining for algorithms such as SVMs. General, efficientalgorithms can be readily used for their computation.

Experiments in several large-vocabulary spoken-dialog tasks show that rational kernels can be combinedwith SVMs to form powerful classifiers and that theyperform well in several difficult tasks. They also demon-strate the benefits of the use of kernels applied to wordlattices.

Rational kernels form a rich family of ker-nels. The kernels used in the experiments wedescribed are only special instances of this gen-eral class of kernels. Rational kernels adapted toa specific spoken-dialog task can be designed.In fact, it is often straightforward to craft priorknowledge about a task in the transducer defin-ing these kernels. One may, for example, excludesome word sequences or regular expressions fromthe similarity measure defined by these kernels oremphasize the importance of others by increasingtheir corresponding weight in the weighted trans-ducer.

PartE

29.7

A Machine Learning Framework for Spoken-Dialog Classification References 595

References

29.1 C. Cortes, V.N. Vapnik: Support-vector networks,Machine Learning 20(3), 273–297 (1995)

29.2 B.E. Boser, I. Guyon, V.N. Vapnik: A training al-gorithm for optimal margin classifiers. In: Proc.5th Annual Workshop Computational LearningTheory, Vol. 5 (ACM, Pittsburg 1992) pp. 144–152

29.3 B. Schölkopf, A. Smola: Learning with Kernels (MITPress, Cambridge 2002)

29.4 P. Bartlett, J. Shawe-Taylor: Generalization per-formance of support vector machines and otherpattern classifiers. In: Advances in Kernel Meth-ods: Support Vector Learning (MIT Press, Cambridge1999) pp. 43–54

29.5 C. Cortes, P. Haffner, M. Mohri: Positive def-inite rational kernels, Proceedings of The 16thAnnual Conference on Computational Learn-ing Theory (COLT 2003) (2003), LNCS 2777, 41–56

29.6 M. Mohri: Semiring frameworks and algorithmsfor shortest-distance problems, J. Automata Lang.Combinat. 7(3), 321–350 (2002)

29.7 H. Lodhi, J. Shawe-Taylor, N. Cristianini, C. Watkins:Text Classification Using String Kernels (NIPS 2000)(MIT Press, Boston 2001) pp. 563–569

29.8 C. Cortes, P. Haffner, M. Mohri: Rational kernels:Theory and algorithms, J. Machine Learning Res.(JMLR) 5, 1035–1062 (2004)

29.9 C. Cortes, P. Haffner, M. Mohri: Rational kernels. In:Advances in Neural Information Processing Systems(NIPS 2002), Vol. 15 (MIT Press, Boston 2002)

29.10 M. Mohri, F.C.N. Pereira, M. Riley: The design prin-ciples of a weighted finite-state transducer library,Theoret. Comput. Sci. 231, 17–32 (2000)

29.11 C. Allauzen, M. Mohri, B. Roark: A generalweighted grammar library. In: Proceedings of theNinth International Conference on Automata (CIAA,Kingston 2004)

29.12 P. Haffner, G. Tur, J. Wright: Optimizing SVMs forcomplex Call Classification, Proceedings ICASSP’03(2003)

29.13 R.E. Schapire, Y. Singer: Boostexter: A boosting-based system for text categorization, MachineLearning 39(2/3), 135–168 (2000)

29.14 C. Cortes, M. Mohri: Moment kernels for regu-lar distributions, Machine Learning 60(1-3), 117–134(2005)

29.15 C. Berg, J.P.R. Christensen, P. Ressel: HarmonicAnalysis on Semigroups (Springer, Berlin, Heidel-berg 1984)

PartE

29

Documents

Springer Handbook of Speech Processing || A Machine Learning Framework for Spoken-Dialog Classification