Attention-based End-to-End Models for Small-Footprint Keyword …lxie.nwpu-aslp.org/papers/2018interspeech-SCH.pdf · 2018. 6. 18. · Attention-based End-to-End Models for Small-Footprint

Attention-based End-to-End Models for Small-Footprint Keyword Spotting

Changhao Shan1,2, Junbo Zhang2, Yujun Wang2, Lei Xie1∗

1Shaanxi Provincial Key Laboratory of Speech and Image Information Processing,School of Computer Science, Northwestern Polytechnical University, Xi’an, China

2Xiaomi Inc., Beijing, China{chshan, lxie}@nwpu-aslp.org, {zhangjunbo, wangyujun}@xioami.com

AbstractIn this paper, we propose an attention-based end-to-end neuralapproach for small-footprint keyword spotting (KWS), whichaims to simplify the pipelines of building a production-qualityKWS system. Our model consists of an encoder and an atten-tion mechanism. Using RNNs, the encoder transforms the inputsignal into a high level representation. Then the attention mech-anism weights the encoder features and generates a fixed-lengthvector. Finally, by linear transformation and softmax function,the vector becomes a score used for keyword detection. We al-so evaluate the performance of different encoder architectures,including LSTM, GRU and CRNN. Experiments on real-worldwake-up data show that our approach outperforms the recentDeep KWS approach [9] by a large margin and the best perfor-mance is achieved by CRNN. To be more specific, with ∼84Kparameters, our attention-based model achieves 1.02% false re-jection rate (FRR) at 1.0 false alarm (FA) per hour.Index Terms: attention-based model, end-to-end keywordspotting, convolutional neural networks, recurrent neural net-works

1. IntroductionKeyword spotting (KWS), or spoken term detection (STD), is atask to detect pre-defined keywords in a stream of audio. Specif-ically, as a typical application of KWS, wake-up word detectionhas become an indispensable function on various devices, inorder to enable users to have a fully hands-free experience. Apractical on-device KWS module must minimize the false rejec-tion rate at a low false alarm rate to make it easy to use, whilelimiting the memory footprint, latency and computational costas small as possible.

As a classic solution, large vocabulary continuous speechrecognition (LVCSR) based systems [1, 2] are widely used inthe KWS task. Although it is flexible to change keywords ac-cording to user’s requirement, the LVCSR based systems needto generate rich lattices and high computational resources arerequired for keyword search. These systems are often designedto search large databases of audio content. Several recent at-tempts have been proposed to reduce the computational cost,e.g., using end-to-end based acoustic models [3, 4]. But thesemodels are still quite large, making them not suitable for small-footprint, low-latency applications. Another classic techniquefor KWS is the keyword/filler hidden Markov model (HMM)approach [5], which remains strongly competitive today. H-MMs are trained for both keyword and non-keyword audio seg-ments, respectively. At runtime, Viterbi decoding is used tosearch the best path in the decoding graph, which can be com-putationally expensive depending on the HMM topology. In

*Corresponding author.

these approaches, Gaussian mixture models (GMMs) were o-riginally used to model the observed acoustic features, but withthe advances in deep learning, deep neural networks (DNNs)have been recently adopted to substitute GMMs [6] with im-proved performances. Some studies replaced HMM by an RNNmodel trained with connectionist temporal classification (CTC)criterion [7] or by an attention-based model [8]. However, thesestudies are still under the keyword/filler framework.

As a small footprint approach used by Google, Deep K-WS [9] has drawn much attention recently. In this approach,a simple DNN is trained to predict the frame-level posteriorsof sub-keyword targets and fillers. When a confidence score,produced by a posterior handing method, exceeds a threshold,a keyword is detected. This approach has shown to outperforma keyword/filler HMM approach. In addition, this approach ishighly attractive to run on the device with small footprint andlow latency, as the size of the DNN can be easily controlledand no graph-search is involved. Later, feed-forward DNNswere substituted by more powerful networks like convolution-al neural networks (CNNs) [10] and recurrent neural networks(RNNs) [11], with expected improvements. It should be notedthat, although the framework of Deep KWS is quite simple, itstill needs a well-trained acoustic model to obtain frame-levelalignments.

In this paper, we aim to further simplify the pipelines ofbuilding a production-quality KWS. Specifically, we proposean attention-based end-to-end neural model for small-footprintkeyword spotting. By saying end-to-end, we mean that: (1)a simple model that directly outputs keyword detection; (2)no complicated searching involved; (3) no alignments need-ed beforehand to train the model. Our work is inspired bythe recent success of attention models used in speech recog-nition [12, 13, 14], machine translation [15], text summariza-tion [16] and speaker verification [17]. It is intuitive to useattention mechanism in KWS: humans are able to focus on acertain region of an audio stream with “high resolution” (e.g.,the listener’s name) while perceiving the surrounding audio in“low resolution”, and then adjusting the focal point over time.

Our end-to-end KWS model consists of an encoder and anattention mechanism. Using RNNs, the encoder transforms theinput signal into a high level representation. Then the atten-tion mechanism weights the encoder features and generates afixed-length vector. Finally, by linear transformation and soft-max function, the vector becomes a score used for keyword de-tection. In terms of end-to-end and small-footprint, the closestapproach to ours is the one proposed by Kliegl et al. [18], wherea convolutional recurrent neural network (CRNN) architectureis used. However, the latency introduced by its long decodingwindow (T=1.5 secs) makes the system difficult to use in realapplications.

To improve our approach, we further explore the encoder

...

Encoder

Attention mechanism

...

...1 T

T

t

tthc1

1h2h Th

1x Tx2x

softmax(Uc)

Input features

Encoder outputs

Normalized weights 2

Decision )(P y

Figure 1: Attention-based end-to-end model for KWS.

architectures, including LSTM [19], GRU [20] and CRNN [18].Experiments on real-world wake-up data show that our ap-proach outperforms Deep KWS by a large margin. GRU ispreferred over LSTM and the best performance is achieved byCRNN. To be more specific, with only ∼84K parameters, theCRNN-based attention model achieves 1.02% false rejectionrate (FRR) at 1.0 false alarm (FA) per hour.

2. Attention-based KWS2.1. End-to-end architecture

We propose to use attention-based end-to-end model in small-footprint keyword spotting. As depicted in Fig. 1, the end-to-end architecture consists of two major sub-modules: the en-coder and the attention mechanism. The encoder results in ahigher-level feature representation h = (h1, ..., hT ) from theinput speech features x = (x1, ..., xT ):

h = Encoder(x). (1)

Specifically, the Encoder is usually a RNN that can directlymake use of speech contextual information. In our work, we ex-plore different encoder structures, including GRU, LSTM andCRNN. The attention mechanism learns normalized weightsαt ∈ [0, 1] from the feature representation:

αt = Attend(ht). (2)

Then we form fixed-length vector c as the weighted average ofthe Encoder outputs h:

c =

T∑t=1

αtht. (3)

Finally, we generate a probability distribution by a linear trans-formation and the softmax function:

p(y) = softmax(Uc). (4)

where U is the linear transform, y indicate whether a keyworddetected.

2.2. Attention mechanism

Similar to human listening attention, the attention mechanismin our model selects the speech parts which are more likely to

Frames

... ...

shift sliding window 2

sliding window 1

Figure 2: Sliding windows used in decoding.

contain the keyword while ignoring the unrelated parts. We in-vestigate both average attention and soft attention.

Average attention: TheAttendmodel does not have train-able parameters and the αt is set as the average of T :

αt = 1/T. (5)

Soft attention: This attention method is borrowed from s-peaker verification [17]. Compared with other attention layers,the shared-parameter non-linear attention is proven to be effec-tive [17]. We first learn a scalar score et:

et = vT tanh(Wht + b). (6)

Then we compute the normalized weight αt using these scalarscores:

αt =exp(et)∑Tj=1 exp(ej)

. (7)

2.3. Decoding

As shown in Fig. 1, unlike some other approaches [9], our end-to-end system outputs a confidence score directly without post-processing. Similar to the Deep KWS system, our system istriggered when p(y = 1) exceeds a preset threshold. Duringdecoding, in Fig. 2, the input is a sliding window of speech fea-tures, which has a preset length and contains the entire keyword.Meanwhile, a frame shift is employed. For a sliding window,we only need to feed one frame into the network for compu-tation and the rest frames have been computed already in theprevious sliding window. From Table 1, we can clearly see thatthe small set of parameters of the proposed approach leads tosmall-footprint memory use while limited multiply operationsresult in low latency in KWS.

Table 1: The number of parameters and multiplies.

Model Params (K) Multiplies (K)DNN KWS [9] 244 244CNN KWS [10] 47.6 428.5Attention-based KWS 77.5 83.3

3. Experiments3.1. Datasets

We evaluated the proposed approach using real-world wake-updata collected from Mi AI Speaker1. The wake-up word is afour-syllable Mandarin Chinese term (“xiao-ai-tong-xue”). Wecollected ∼188.9K positive examples (∼99.8h) and ∼1007.4K

1https://www.mi.com/aispeaker/

Table 2: Performance comparison between Deep KWS andattention-based models with 2-64 network. FRR is at 1.0 falsealarm (FA) per hour.

Model FRR (%) Params (K)DNN KWS 13.9 62.5LSTM KWS 7.10 54.1LSTM average attention 4.43 60.0LSTM soft attention 3.58 64.3GRU KWS 6.38 44.8GRU average attention 3.22 49.2GRU soft attention 1.93 53.4

Figure 3: ROCs for Deep KWS vs. Attention-based system with2-64 network.

negative examples (∼1581.8h) as the training set. The held-outvalidation set has ∼9.9K positive examples and ∼53.0K nega-tive examples. The test data set has ∼28.8K positive examples(∼15.2h) and ∼32.8K negative examples (∼37h). Each au-dio frame was computed based on a 40-channel Mel-filterbankwith 25ms windowing and 10ms frame shift. Then the filter-bank feature was converted to per-channel energy normalized(PCEN) [21] Mel-spectrograms.

3.2. Baseline

We reimplemented the Deep KWS system [9] as the baseline,in which the network predicts the posteriors for the four Chi-nese syllables in the wake-up word and a filler. The “filler” heremeans any voice that does not contain the keyword. Specifical-ly, we adopted three different networks, including DNN, LSTMand GRU. For a fare comparison, the network configuration wasset to have similar size of parameters with the proposed atten-tion models. The feed-forward DNN model had 3 hidden layersand 64 hidden nodes per layer with rectified linear unit (ReLU)non-linearity. An input window with 15 left frames and 5 rightframes was used. The LSTM and GRU models were built with2 hidden layers and 64 hidden nodes per layer. For the GRUKWS model, the final GRU layer was followed by a fully con-nected layer with ReLU non-linearity. There were no stackedframes in the input for the LSTM and GRU models. The s-moothing window for Deep KWS was set to 20 frames. Wealso trained a TDNN-based acoustic model [22] using ∼3000hours of speech data to perform frame-level alignment beforeKWS model training.

Table 3: Performance of different encoder architectures withsoft attention. FRR is at 1.0 false alarm (FA) per hour.

Recurrent Unit Layer Node FRR (%) Params (K)LSTM 1 64 4.36 31.2LSTM 2 64 3.58 64.3LSTM 3 64 3.05 97.3LSTM 1 128 2.99 103GRU 1 64 3.22 28.7GRU 2 64 1.93 53.4GRU 3 64 1.99 78.2GRU 1 128 1.49 77.5

Table 4: Performance of adding convolutional layers in theGRU (CRNN) attention-based model with soft attention. FRR isat 1.0 false alarm (FA) per hour.

Channel Layer Node FRR (%) Params (K)8 1 64 2.48 52.58 2 64 1.34 77.316 1 64 1.02 84.116 2 64 1.29 109

3.3. Experimental setup

In the neural network models, all the weight matrices were ini-tialized with the normalized initialization [23] and the bias vec-tors were initialized to 0. We used ADAM [24] as the opti-mization method while we decayed the learning rate from 1e-3to 1e-4 after it converged. Gradient norm clipping to 1 was ap-plied, together with L2 weight decay 1e-5. The positive trainingsample has a frame length of T = 1.9 seconds which ensuresthe entire wake-up word is included. Accordingly, in the atten-tion models, the input window has set to 189 frames to coverthe length of the wake-up word. We randomly selected 189contiguous frames from the negative example set to train the at-tention models. At runtime, the sliding window was set to 100frames and frame shift was set to 1. Performances were mea-sured by observing the FRR at the operating threshold of 1.0 FAper hour, while plotting a receiver operating curve (ROC).

3.4. Impact of attention mechanism

From Table 2 and Fig. 3, we can clearly see the superior perfor-mances of the attention models. With similar size of parame-ters, the proposed attention models outperform the Deep KWSsystems by a large margin. We also note that GRU is preferredover LSTM in both Deep KWS and the attention models. Notsurprisingly, the soft attention-based model achieves the bestperformance. At 1.0 FA/hour, the GRU attention model reducesthe FRR from 6.38% (GRU Deep KWS) down to 1.93% with aremarkable false rejection reduction.

3.5. Impact of encoder architecture

We further explored the impact of encoder architectures withsoft attention. Results are summarized in Table 3, Fig. 4 andFig. 5. From Table 3, we notice that the bigger models alwaysperform better than the smaller models. Observing the LSTMmodels, the 1-128 LSTM model achieves the best performancewith an FRR of 2.99% at 1.0 FA/hour. In Fig. 4, the ROC curvesof the 1-128 LSTM model and the 3-64 LSTM model are over-lapped at lower FA per hour. This means making the LSTMnetwork wider or deeper can achieve the same effect. However,observing Fig. 5, the same conclusion does not hold for GRU.The 1-128 GRU model presents a significant advantage over

Figure 4: ROCs for LSTM Attention-based model with soft at-tention.

Figure 5: ROCs for GRU Attention-based model with soft atten-tion.

the 3-64 GRU model. In other words, increasing the numberof nodes may be more effective than increasing the number oflayers. Finally, the 1-128 GRU model achieves 1.49% FRR at1.0 FA/hour.

3.6. Adding convolutional layer

Inspired by [18], finally we studied the impact of adding con-volutional layers in the GRU attention-model as convolutionalnetwork is often used as a way to extract invariant features. Forthe CRNN attention-based model, we used one layer CNN thathas a C(20×5)

1×2filter. We explored different numbers of output

channel and results are summarized in Table 4 and Fig. 6. FromTable 4, we can see that adding convolutional layer can furtherimprove the performance. We achieve the lowest FRR of 1.02%at 1.0 FA/hour with 84.1K parameters. Another observation isthat 16-channel models work better than 8-channel models. Byincreasing layers, the 8-2-64 model achieves a great gain overthe 8-1-64 model. But we cannot observe extra benefit whenincreasing the layers with 16-channel models.

As a summary, Fig. 7 plots the ROC curves for the bestthree systems. We can see that GRU and CRNN outperformLSTM by a large margin and the best performance is achievedby CRNN.

Figure 6: ROCs for CRNN Attention-based model with soft at-tention.

Figure 7: ROCs for different architectures with soft Attention-based model.

4. ConclusionsIn this paper, we propose an attention-based end-to-end modelfor small-footprint keyword spotting. Compared with the DeepKWS system, the attention-based system achieves superior per-formance. Our system consists of two main sub-modules: theencoder and the attention mechanism. We explore the encoderarchitectures, including LSTM, GRU and CRNN. Experimentsshow that GRU is preferred over LSTM and the best perfor-mance is achieved by CRNN. We also explore two attentionmechanisms: average attention and soft attention. Our resultsshow that the soft attention has a better performance than theaverage attention. With ∼84K parameters, our end-to-end sys-tem finally achieves 1.02% FRR at 1.0 FA/hour.

5. AcknowledgementsThe authors would like to thank Jingyong Hou for helpful com-ments and suggestions. The research work is supported bythe National Natural Science Foundation of China (Grant No.61571363).

6. References[1] P. Motlicek, F. Valente, and I. Szoke, “Improving acoustic based

keyword spotting using lvcsr lattices,” in Acoustics, Speech andSignal Processing (ICASSP), 2012 IEEE International Confer-ence on. IEEE, 2012, pp. 4413–4416.

[2] D. Can and M. Saraclar, “Lattice indexing for spoken term detec-tion,” IEEE Transactions on Audio, Speech, and Language Pro-cessing, vol. 19, no. 8, pp. 2338–2347, 2011.

[3] Y. Bai, J. Yi, H. Ni, Z. Wen, B. Liu, Y. Li, and J. Tao, “End-to-end keywords spotting based on connectionist temporal classi-fication for mandarin,” in Chinese Spoken Language Processing(ISCSLP), 2016 10th International Symposium on. IEEE, 2016,pp. 1–5.

[4] A. Rosenberg, K. Audhkhasi, A. Sethy, B. Ramabhadran, andM. Picheny, “End-to-end speech recognition and keyword searchon low-resource languages,” in IEEE International Conference onAcoustics, Speech and Signal Processing, 2017, pp. 5280–5284.

[5] J. R. Rohlicek, W. Russell, S. Roukos, and H. Gish, “Continuoushidden markov modeling for speaker-independent word spotting,”in Acoustics, Speech, and Signal Processing, 1989. ICASSP-89.,1989 International Conference on. IEEE, 1989, pp. 627–630.

[6] I. Szke, P. Schwarz, P. Matejka, L. Burget, M. Karafit, M. Fap-so, and J. Cernock, “Comparison of keyword spotting approachesfor informal continuous speech,” in INTERSPEECH 2005 - Eu-rospeech, European Conference on Speech Communication andTechnology, Lisbon, Portugal, September, 2005, pp. 633–636.

[7] K. Hwang, M. Lee, and W. Sung, “Online keyword spotting witha character-level recurrent neural network,” arXiv preprint arX-iv:1512.08903, 2015.

[8] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, andI. McGraw, “Streaming small-footprint keyword spotting usingsequence-to-sequence models,” arXiv preprint arXiv:1710.09617,2017.

[9] G. Chen, C. Parada, and G. Heigold, “Small-footprint keywordspotting using deep neural networks,” in Acoustics, speech andsignal processing (ICASSP), 2014 ieee international conferenceon. IEEE, 2014, pp. 4087–4091.

[10] T. N. Sainath and C. Parada, “Convolutional neural networksfor small-footprint keyword spotting,” in Sixteenth Annual Con-ference of the International Speech Communication Association,2015.

[11] M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal,S. Matsoukas, N. Strom, and S. Vitaladevuni, “Max-pooling losstraining of long short-term memory networks for small-footprintkeyword spotting,” in Spoken Language Technology Workshop(SLT), 2016 IEEE. IEEE, 2016, pp. 474–480.

[12] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Ben-gio, “End-to-end attention-based large vocabulary speech recog-nition,” in Acoustics, Speech and Signal Processing (ICASSP),2016 IEEE International Conference on. IEEE, 2016, pp. 4945–4949.

[13] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spel-l: A neural network for large vocabulary conversational speechrecognition,” in Acoustics, Speech and Signal Processing (ICAS-SP), 2016 IEEE International Conference on. IEEE, 2016, pp.4960–4964.

[14] C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-based end-to-end speech recognition on voice search,” in Acoustics, Speechand Signal Processing (ICASSP), 2017 IEEE International Con-ference on. IEEE.

[15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla-tion by jointly learning to align and translate,” arXiv preprint arX-iv:1409.0473, 2014.

[16] A. M. Rush, S. Chopra, and J. Weston, “A neural attention mod-el for abstractive sentence summarization,” arXiv preprint arX-iv:1509.00685, 2015.

[17] F. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention-based models for text-dependent speaker verification,” arXivpreprint arXiv:1710.10470, 2017.

[18] S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky,C. Fougner, R. Prenger, and A. Coates, “Convolutional recur-rent neural networks for small-footprint keyword spotting,” arXivpreprint arXiv:1703.05390, 2017.

[19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[20] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase repre-sentations using rnn encoder-decoder for statistical machine trans-lation,” arXiv preprint arXiv:1406.1078, 2014.

[21] Y. Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous,“Trainable frontend for robust and far-field keyword spotting,” inAcoustics, Speech and Signal Processing (ICASSP), 2017 IEEEInternational Conference on. IEEE, 2017, pp. 5670–5674.

[22] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neuralnetwork architecture for efficient modeling of long temporal con-texts,” in Sixteenth Annual Conference of the International SpeechCommunication Association, 2015.

[23] X. Glorot and Y. Bengio, “Understanding the difficulty of trainingdeep feedforward neural networks,” Journal of Machine LearningResearch, vol. 9, pp. 249–256, 2010.

[24] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980, 2014.

Documents

Attention-based End-to-End Models for Small-Footprint Keyword …lxie.nwpu-aslp.org/papers/2018interspeech-SCH.pdf · 2018. 6. 18. · Attention-based End-to-End Models for Small-Footprint