Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Study on Keyword Spotting using Prosodic Attribute Detection for
Conversational SpeechYu-Jui Huang
Department of Computer Science and Information EngineeringNational Chia-Yi [email protected]
Yin-Wei Chung
Department of Computer Science and Information EngineeringNational Chia-Yi [email protected]
Jui-Feng Yeh
Department of Computer Science and Information EngineeringNational Chia-Yi University
(SVM)
SVM
Abstract
It is one of most essential issues to extract the keywords from conversational speech for understanding the utterances from speakers. This thesis aims at keyword spotting from spontaneous speech for keyword detecting. We proposed prosodic features that are used for keyword detection. The prosody words are segmented from speaker’s utterance according to the pre-training decision tree. The supported vector machine is further used as the classifier to judge the prosody word is keyword or not. The prosody word boundary segmentation algorithm based on decision tree is illustrated. Besides the data driven feature, the knowledge obtained from the corpus observation is integrated in the decision tree. Finally, the keyword
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
231
in the focus part are extracted using prosody features by sported vector machine (SVM). According to the experimental results, we can find the proposed method outperform the phone verification approach especially in recall and accuracy. This shows the proposed approach is operative for keyword detecting.
Keywords: Keyword spotting, prosodic feature, prosody word, spoken language.
(Keyword spotting)
( )(Spontaneous speech)
(Dialogue system) (Speaking style)(Grammar)
(Real time)Kawahara (Keyword extraction)
(Verification)(Key-phrase detection) (Key-phrase verification) (Sentence parsing)
(sentence verification)(Incremental understanding) [1]
Charpter [2]
(Spoken Language Understanding, SLU)
(Knowledge based) [3](Prosodic attribute)
(Hierarchical Prosodic Phrase Grouping, HPG)[4][5] (Prosodic word)
Ali[1] Wieland
Bi-gram Beam-search Viterbi[6] Bitar
HMM[7] Rabiner 1989
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
232
[8] Tatsuya Kawahara Chin-Hui Lee Key-Phrase Detection Verification
[9]
Rose[10] HMM
(filler) Zhang[11]
Bahi[12]
HMMBazzi
HMM [13]
Lee C.H.[14]Kim[15]
[16][17]
Haizhou Li, Bin Ma, and Chin-Hui Lee [18]
AuToBi [19]POS HMM
Conkie [20]delta HMM
Sridhar[21] HMMHMM
Erteschik-shir [22]
[23]
[24][25]
[26]MFCC
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
233
[27] SVM
HPGFujisaki Model
[4][5][28][29][30] HPG
1
1:
(Prosodic Attributes Extraction)(Pitch) (Intensity) (Duration)
(HPG)(Prosodic Word Boundary) (Boundary Decision Tree)
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
234
SVM(Keyword Detector)
(Prosodic Word Detection)(Keyword Detection)
(syllable) (prosodic word)(intonation phrase)
(Hierarchical Prosodic Phrase Grouping, HPG)[4][5]
(syllable, Syl)(prosodic word, PW) (prosodic phrase, PPh) (breath-group)
(prosodic phrase group, PG)B1 B2 B3 B4 B5
B5 B1
B2
2 B
2 (Prosody word)
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
235
[4][5]
3 9
>
Case 1
Case 2
Case 3
Case 4 Case 5 Case 6 Case 7 Case 8 Case 9
(Pause)
(Pitch Reset) (Pitch Reset)
3 HPG
9(Pitch reset)
1
1 HPG
Case 1 >
Case 2
Case 3
Case 4 (Pitch reset)
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
236
Case 5
Case 6
Case 7
Case 8 (Pitch reset)
Case 9
(1) =0.040.03 0.05
0.04
(2)( 1)
(slope) i
( )i i iP t t� �� � ( 1)
Pi(t) i t i i biei i 2
2
( )( ( ) ), [ , ]
( )
i
i
i
i
e
i it b
i i ie
t b
t t P t Pt b e
t t� �
�
� �� �
�
�
�( 2)
t 3 iP i4 n
1 ( )2 i it e b� � ( 3)
1 ( )i
i
e
i it b
P P tn �
� � ( 4)
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
237
i upper bound lower boundi upper bound i lower bound
upper bound i lower boundcase 4 pitch reset case 8 pitch reset
SVM(predict) +1
-1 5
1, 1,
ii
if T is semantic objectT
otherwise�
� ��
( 5)
SVM
(1)
01-10 ini ijP i j
1 2{ , ,... }i i in iP P P PW� DurijP i j Bi
Ei _ iSyl N i _ijSyl bi j _ijSyl e i j
(2)
bpauseepause 11
(3)
1312-13
13
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
238
(Keyword spotting) (Speech act)(Semantic slot)
DA pair Erteschik-shir [23]
(Topic) (Focus)
(Pragmatics)
4 54
5
4: DA pairs 5: DA pairs
52 247 56873
173 1061 850211
850 2498 211660
(True Positive, TP)(False Negative, FN)
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
239
(True Negative, TN)(False Positive, FP) 6
6
(accuracy) (precision)(recall) 6 7 8
TP TNaccuracyTP FP TN FN
��
� � � ( 6)
TPprecisionTP FP
��
( 7)
TPrecallTP FN
��
( 8)
(1) SVMSVM
23-5% 58% 80%
10 858%
2: SVM
accuracy precision recall
4 5 12 13(c=1 g=8) 77.16% 57.83% 68.25%
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
240
4 5 12 13(c=10 g=16) 74.10% 52.90% 69.19%
4 5 11 12 13(c=1 g=8) 77.42% 58.17% 69.19%
4 5 11 12 13(c=10 g=16) 74.10% 52.94% 68.45%
3 5 6-9 12(c=1 g=8) 75.83% 54.69% 80.0%
3 5 6-9 12(c=10 g=16) 73.04% 51.25% 77.73%
4 5 6-8 12(c=1 g=8) 74.90% 54.01% 70.14%
4 5 6-8 12(c=10 g=16) 71.58% 49.5% 70.62%
(2) SVMSVM
3100%
SVMTP
3: SVM
accuracy precision recall
4 5 12 13(c=1 g=8) 83.38% 70.95% 75.33%
4 5 12 13(c=10 g=16) 81.40% 65.41% 78.03%
4 5 11 12 13(c=1 g=8) 83.51% 70.83% 75.56%
4 5 11 12 13(c=10 g=16) 81.35% 64.91% 77.48%
3 5 6-9 12(c=1 g=8) 82.45% 66.33% 85.15%
3 5 6-9 12(c=10 g=16) 80.61% 63.00% 84.00%
4 5 6-8 12(c=1 g=8) 80.47% 65.02% 75.33%
4 5 6-8 12(c=10 g=16) 76.65% 58.42% 75.22%
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
241
[14] HTK forced alignmentHMM
(filler) 4
15%
4
accuracy precision recall
Reference 68% 70.22% 68.45%
Label + SVM 77.42% 58.17% 80%
Decision Tree + SVM 83.51% 70.95% 85.15%
HPG SVM
SVM51%~58% 68%~80% 51%~59%
(True Positive, TP) 76%~83%
58%~71% 75%~85%
(NSC 99-2221-E-415-006-MY3) .
[1] Ali, J. Van der Spiegel, P. Mueller, G. Haentjens ,and J. Berman, “An Acoustic-PhoneticFeature-Based System for Automatic Phoneme Recognition in Continuous Speech,” ISCAS 1998.
[2] N. Chater, M. Pickering, and D. Milward. “What is incremental interpretation? ” Edinburgh Working Papers in Cognitive Science, 11:1–22, 1995.
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
242
[3] J. Li, Y. Tsao and C.H. Lee, “A Study on Knowledge Source Integration for Candidate Rescoring in Automatic Speech Recognition,” ICASSP, IEEE International Conference, vol 1, pp837-840, 2005.
[4] , , 11(2):183-218, 2010.[5] , ,
9.3:659-719, 2008.[6] E. Wieland, F. Gallwitz, and H. Niemann. “Combining stochastic and linguistic language
models for recognition of spontaneous speech.” In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing, vol.1, Atlanta, May, pp 423–426, 1996.
[7] N. N. Bitar and C. Y. Espy-Wilson , “Knowledge-based Parameters for HMM Speech Recognition,” ICASSP 1996.
[8] L. R. Rabiner, “A tutorial on hidden markov models and selected application in speech recognition,” Proceedings of the IEEE, vol.77, no. 2, Feb. 1989.
[9] T. Kawahara, C.H. Lee, and B.H. Juang, “Flexible Speech Understanding Based on Combined Key-Phrase Detection and Verification”, IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, vol.6, NO. 6, pp.558-568, 1998.
[10] R. C. Rose, D. B. Paul, “A Hidden Markov Model Based Keyword Recognition System” Acoustics, Speech, and Signal Processing, ICASSP, vol.1, Page(s): 129 - 132, 1990.
[11] P. Zhang, J. Han, J. Shao, Y. Yan, “A New Keyword Spotting Approach for Spontaneous Mandarin Speech” Signal Processing, 8th International Conference on vol.1, 2006.
[12] H. Bahi, N. Benati, “A New Keyword Spotting Approach” Multimedia Computing and Systems, ICMCS, International Conference , pp.77–80, 2009.
[13] I. Bazzi and J. Glass, “Modeling out-of-vocabulary words for robust speech recognition,” Proc. ICSLP, Beijing, 2000.
[14] H. Jiang, C.H. Lee, “A new approach to utterance verification based on neighborhood information in model space”, IEEE Trans. Speech Audio Process. 11(5), pp. 425-434, 2003.
[15] T.-Y. Kim and H. Ko, “Bayesian Fusion of Confidence Measures for Speech Recognition”, IEEE SIGNAL PROCESSING LETTERS, vol.12, NO. 12, Dec 2005.
[16] Y. BenAyed, D. Fohr, J. P. Haton, G. Chollet, “Improving the Performance of a Keyword Spotting System by Using Support Vector Machines”, in IEEE Auto Speech Recogniton and Understanding Workshop ASRU, St, Thomas, U.S. Virgin islands, Dec 2003.
[17] R. Rose, “Confidence measures for the Switchboard database”, Proc. of International Conference on Acoustics, Speech and Signal Processing, pp.511-514, 1996.
[18] H. Li, B. Ma, and C.H. Lee. “A Vector Space Modeling Approach to Spoken Language Identification”, Audio, Speech, and Language Processing, IEEE Transactions on vol. 15,
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
243
NO. 1, JANUARY, pp 271-284, 2007.[19] AuToBi. http://eniac.cs.qc.cuny.edu/andrew/autobi/index.html[20] A. Conkie, G. Riccardi, and R. Rose. “Prosody recognition from speech utterances using
acoustic and linguistic based models of prosodic events”. In Eurospeech, 1999.[21] V. R. Sridhar, S. Bangalore, and S. Narayanan. Exploiting acoustic and syntactic features
for prosody labeling in a maximum entropy framework. IEEE Transactions on Audio, Speech & Language Processing, 16(4):797–811, 2008.
[22] N. Erteschik-shir, Information Structure: The Syntax-Discourse Interface, 2007.[23] , , ,
89[24] , , ,
89[25] , , , 93
[26] , , , 95[27] , , ,
96[28] C.Y. Tseng, “Discourse Speech Tempo”. JAIST Symposium on Modeling of Speech and
Audiovisual Mechanism. Ishikawa, Japan. 2011.[29] C.Y. Tseng, and C.H. Chang, 2007. Pause or No Pause? Phrase Boundaries
Revisited . The 9th National Conference on Man-Machine Speech CommunicationNCMMSC). , , 2007.
[30] .
280-312. , , 2008
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
244
01 ( )NumiP PW i in
02 ( )DuriP PW i
1
inDurij
jP
��
03 _ ( )Dur MaxiP PW i 1 2{ , ,..., }Dur Dur Dur
i i inMax P P P
04 _ ( )Dur MiniP PW i 1 2{ , ,..., }Dur Dur Dur
i i inMin P P P
05 ( )iDur PW i ( )i i iB E Pause PW� �
06 ( )iSyl PW i _ iSyl N
07 1( )iDur Syl i 1 1 1_ _i iSyl e Syl b�
08 2( )iDur Syl i 2 2 2_ _i iSyl e Syl b�
09 3( )iDur Syl i 3 3 3_ _i iSyl e Syl b�
10 4( )iDur Syl i 4 4 4_ _i iSyl e Syl b�
11 ( )iPause PW i pause pausee b�
12 ( )ipos PW i iBE
13 ( )N Speech N
Proceedings of the Twenty-Fourth Conference on Computational Linguistics and Speech Processing (ROCLING 2012)
245