Study on Keyword Spotting using Prosodic Attribute ...IÖ µ#´F, =Ôaug 5a.·p»6±!ÑDÐP + , /Ï ã'¢gã ãP¥aw ãPOS Jñp»!®78 ² 6±HMM ã

Study on Keyword Spotting using Prosodic Attribute Detection for

Conversational SpeechYu-Jui Huang

Department of Computer Science and Information EngineeringNational Chia-Yi [email protected]

Yin-Wei Chung

Department of Computer Science and Information EngineeringNational Chia-Yi [email protected]

Jui-Feng Yeh

Department of Computer Science and Information EngineeringNational Chia-Yi University

[email protected]

(SVM)

SVM

Abstract

It is one of most essential issues to extract the keywords from conversational speech for understanding the utterances from speakers. This thesis aims at keyword spotting from spontaneous speech for keyword detecting. We proposed prosodic features that are used for keyword detection. The prosody words are segmented from speaker’s utterance according to the pre-training decision tree. The supported vector machine is further used as the classifier to judge the prosody word is keyword or not. The prosody word boundary segmentation algorithm based on decision tree is illustrated. Besides the data driven feature, the knowledge obtained from the corpus observation is integrated in the decision tree. Finally, the keyword

Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

231

in the focus part are extracted using prosody features by sported vector machine (SVM). According to the experimental results, we can find the proposed method outperform the phone verification approach especially in recall and accuracy. This shows the proposed approach is operative for keyword detecting.

Keywords: Keyword spotting, prosodic feature, prosody word, spoken language.

(Keyword spotting)

( )(Spontaneous speech)

(Dialogue system) (Speaking style)(Grammar)

(Real time)Kawahara (Keyword extraction)

(Verification)(Key-phrase detection) (Key-phrase verification) (Sentence parsing)

(sentence verification)(Incremental understanding) [1]

Charpter [2]

(Spoken Language Understanding, SLU)

(Knowledge based) [3](Prosodic attribute)

(Hierarchical Prosodic Phrase Grouping, HPG)[4][5] (Prosodic word)

Ali[1] Wieland

Bi-gram Beam-search Viterbi[6] Bitar

HMM[7] Rabiner 1989


232

[8] Tatsuya Kawahara Chin-Hui Lee Key-Phrase Detection Verification

[9]

Rose[10] HMM

(filler) Zhang[11]

Bahi[12]

HMMBazzi

HMM [13]

Lee C.H.[14]Kim[15]

[16][17]

Haizhou Li, Bin Ma, and Chin-Hui Lee [18]

AuToBi [19]POS HMM

Conkie [20]delta HMM

Sridhar[21] HMMHMM

Erteschik-shir [22]

[23]

[24][25]

[26]MFCC


233

[27] SVM

HPGFujisaki Model

[4][5][28][29][30] HPG

1

1:

(Prosodic Attributes Extraction)(Pitch) (Intensity) (Duration)

(HPG)(Prosodic Word Boundary) (Boundary Decision Tree)


234

SVM(Keyword Detector)

(Prosodic Word Detection)(Keyword Detection)

(syllable) (prosodic word)(intonation phrase)

(Hierarchical Prosodic Phrase Grouping, HPG)[4][5]

(syllable, Syl)(prosodic word, PW) (prosodic phrase, PPh) (breath-group)

(prosodic phrase group, PG)B1 B2 B3 B4 B5

B5 B1

B2

2 B

2 (Prosody word)


235

[4][5]

3 9

>

Case 1

Case 2

Case 3

Case 4 Case 5 Case 6 Case 7 Case 8 Case 9

(Pause)

(Pitch Reset) (Pitch Reset)

3 HPG

9(Pitch reset)

1

1 HPG

Case 1 >

Case 2

Case 3

Case 4 (Pitch reset)


236

Case 5

Case 6

Case 7

Case 8 (Pitch reset)

Case 9

(1) =0.040.03 0.05

0.04

(2)( 1)

(slope) i

( )i i iP t t� �� ( 1)

Pi(t) i t i i biei i 2

2

( )( ( ) ), [ , ]

( )

i

i

i

i

e

i it b

i i ie

t b

t t P t Pt b e

t t� �

�

� ��

�

�

�( 2)

t 3 iP i4 n

1 ( )2 i it e b� � ( 3)

1 ( )i

i

e

i it b

P P tn �

� � ( 4)


237

i upper bound lower boundi upper bound i lower bound

upper bound i lower boundcase 4 pitch reset case 8 pitch reset

SVM(predict) +1

-1 5

1, 1,

ii

if T is semantic objectT

otherwise�

� ��

( 5)

SVM

(1)

01-10 ini ijP i j

1 2{ , ,... }i i in iP P P PW� DurijP i j Bi

Ei _ iSyl N i _ijSyl bi j _ijSyl e i j

(2)

bpauseepause 11

(3)

1312-13

13


238

(Keyword spotting) (Speech act)(Semantic slot)

DA pair Erteschik-shir [23]

(Topic) (Focus)

(Pragmatics)

4 54

5

4: DA pairs 5: DA pairs

52 247 56873

173 1061 850211

850 2498 211660

(True Positive, TP)(False Negative, FN)


239

(True Negative, TN)(False Positive, FP) 6

6

(accuracy) (precision)(recall) 6 7 8

TP TNaccuracyTP FP TN FN

��

� � � ( 6)

TPprecisionTP FP

��

( 7)

TPrecallTP FN

��

( 8)

(1) SVMSVM

23-5% 58% 80%

10 858%

2: SVM

accuracy precision recall

4 5 12 13(c=1 g=8) 77.16% 57.83% 68.25%


240

4 5 12 13(c=10 g=16) 74.10% 52.90% 69.19%

4 5 11 12 13(c=1 g=8) 77.42% 58.17% 69.19%

4 5 11 12 13(c=10 g=16) 74.10% 52.94% 68.45%

3 5 6-9 12(c=1 g=8) 75.83% 54.69% 80.0%

3 5 6-9 12(c=10 g=16) 73.04% 51.25% 77.73%

4 5 6-8 12(c=1 g=8) 74.90% 54.01% 70.14%

4 5 6-8 12(c=10 g=16) 71.58% 49.5% 70.62%

(2) SVMSVM

3100%

SVMTP

3: SVM


4 5 12 13(c=1 g=8) 83.38% 70.95% 75.33%

4 5 12 13(c=10 g=16) 81.40% 65.41% 78.03%

4 5 11 12 13(c=1 g=8) 83.51% 70.83% 75.56%

4 5 11 12 13(c=10 g=16) 81.35% 64.91% 77.48%

3 5 6-9 12(c=1 g=8) 82.45% 66.33% 85.15%

3 5 6-9 12(c=10 g=16) 80.61% 63.00% 84.00%

4 5 6-8 12(c=1 g=8) 80.47% 65.02% 75.33%

4 5 6-8 12(c=10 g=16) 76.65% 58.42% 75.22%


241

[14] HTK forced alignmentHMM

(filler) 4

15%

4


Reference 68% 70.22% 68.45%

Label + SVM 77.42% 58.17% 80%

Decision Tree + SVM 83.51% 70.95% 85.15%

HPG SVM

SVM51%~58% 68%~80% 51%~59%

(True Positive, TP) 76%~83%

58%~71% 75%~85%

(NSC 99-2221-E-415-006-MY3) .

[1] Ali, J. Van der Spiegel, P. Mueller, G. Haentjens ,and J. Berman, “An Acoustic-PhoneticFeature-Based System for Automatic Phoneme Recognition in Continuous Speech,” ISCAS 1998.

[2] N. Chater, M. Pickering, and D. Milward. “What is incremental interpretation? ” Edinburgh Working Papers in Cognitive Science, 11:1–22, 1995.


242

[3] J. Li, Y. Tsao and C.H. Lee, “A Study on Knowledge Source Integration for Candidate Rescoring in Automatic Speech Recognition,” ICASSP, IEEE International Conference, vol 1, pp837-840, 2005.

[4] , , 11(2):183-218, 2010.[5] , ,

9.3:659-719, 2008.[6] E. Wieland, F. Gallwitz, and H. Niemann. “Combining stochastic and linguistic language

models for recognition of spontaneous speech.” In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol.1, Atlanta, May, pp 423–426, 1996.

[7] N. N. Bitar and C. Y. Espy-Wilson , “Knowledge-based Parameters for HMM Speech Recognition,” ICASSP 1996.

[8] L. R. Rabiner, “A tutorial on hidden markov models and selected application in speech recognition,” Proceedings of the IEEE, vol.77, no. 2, Feb. 1989.

[9] T. Kawahara, C.H. Lee, and B.H. Juang, “Flexible Speech Understanding Based on Combined Key-Phrase Detection and Verification”, IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, vol.6, NO. 6, pp.558-568, 1998.

[10] R. C. Rose, D. B. Paul, “A Hidden Markov Model Based Keyword Recognition System” Acoustics, Speech, and Signal Processing, ICASSP, vol.1, Page(s): 129 - 132, 1990.

[11] P. Zhang, J. Han, J. Shao, Y. Yan, “A New Keyword Spotting Approach for Spontaneous Mandarin Speech” Signal Processing, 8th International Conference on vol.1, 2006.

[12] H. Bahi, N. Benati, “A New Keyword Spotting Approach” Multimedia Computing and Systems, ICMCS, International Conference , pp.77–80, 2009.

[13] I. Bazzi and J. Glass, “Modeling out-of-vocabulary words for robust speech recognition,” Proc. ICSLP, Beijing, 2000.

[14] H. Jiang, C.H. Lee, “A new approach to utterance verification based on neighborhood information in model space”, IEEE Trans. Speech Audio Process. 11(5), pp. 425-434, 2003.

[15] T.-Y. Kim and H. Ko, “Bayesian Fusion of Confidence Measures for Speech Recognition”, IEEE SIGNAL PROCESSING LETTERS, vol.12, NO. 12, Dec 2005.

[16] Y. BenAyed, D. Fohr, J. P. Haton, G. Chollet, “Improving the Performance of a Keyword Spotting System by Using Support Vector Machines”, in IEEE Auto Speech Recogniton and Understanding Workshop ASRU, St, Thomas, U.S. Virgin islands, Dec 2003.

[17] R. Rose, “Confidence measures for the Switchboard database”, Proc. of International Conference on Acoustics, Speech and Signal Processing, pp.511-514, 1996.

[18] H. Li, B. Ma, and C.H. Lee. “A Vector Space Modeling Approach to Spoken Language Identification”, Audio, Speech, and Language Processing, IEEE Transactions on vol. 15,


243

NO. 1, JANUARY, pp 271-284, 2007.[19] AuToBi. http://eniac.cs.qc.cuny.edu/andrew/autobi/index.html[20] A. Conkie, G. Riccardi, and R. Rose. “Prosody recognition from speech utterances using

acoustic and linguistic based models of prosodic events”. In Eurospeech, 1999.[21] V. R. Sridhar, S. Bangalore, and S. Narayanan. Exploiting acoustic and syntactic features

for prosody labeling in a maximum entropy framework. IEEE Transactions on Audio, Speech & Language Processing, 16(4):797–811, 2008.

[22] N. Erteschik-shir, Information Structure: The Syntax-Discourse Interface, 2007.[23] , , ,

89[24] , , ,

89[25] , , , 93

[26] , , , 95[27] , , ,

96[28] C.Y. Tseng, “Discourse Speech Tempo”. JAIST Symposium on Modeling of Speech and

Audiovisual Mechanism. Ishikawa, Japan. 2011.[29] C.Y. Tseng, and C.H. Chang, 2007. Pause or No Pause? Phrase Boundaries

Revisited . The 9th National Conference on Man-Machine Speech CommunicationNCMMSC). , , 2007.

[30] .

280-312. , , 2008


244

01 ( )NumiP PW i in

02 ( )DuriP PW i

1

inDurij

jP

��

03 _ ( )Dur MaxiP PW i 1 2{ , ,..., }Dur Dur Dur

i i inMax P P P

04 _ ( )Dur MiniP PW i 1 2{ , ,..., }Dur Dur Dur

i i inMin P P P

05 ( )iDur PW i ( )i i iB E Pause PW� �

06 ( )iSyl PW i _ iSyl N

07 1( )iDur Syl i 1 1 1_ _i iSyl e Syl b�




11 ( )iPause PW i pause pausee b�

12 ( )ipos PW i iBE

13 ( )N Speech N


245

Documents

Study on Keyword Spotting using Prosodic Attribute ...IÖ µ#´F, =Ôaug 5a.·p»6±!ÑDÐP + , /Ï ã'¢gã ãP¥aw ãPOS Jñp»!®78 ² 6±HMM ã