Study of Word-Level Accent Classi cation and Gender Factorsstudents.cse.tamu.edu/xingwang/courses/csce666_accent_native_indian.pdf · 2 / Project Report :CSE666 (2013) 2 Related Work

Project Report :CSE666 (2013)

Study of Word-Level Accent Classification and Gender

Factors

Xing Wang, Peihong Guo, Tian Lan, Guoyu Fu,{wangxing.pku, peihongguo, welkinlan, fgy108}@gmail.com

Department of Computer Science and Engineering, Texas A&M University

Abstract

In this work, we conduct word-level accent classification. Different features, words, and learning methodsare explored for accent classification, and results show that HMM-MFCC models show promising performance.Besides, we also explore the effect of gender on accent classification. Results show that models trainedon male data do not generalized well on female data; Models trained on both male and female data isnot always better than models trained on female data only. At last, we propose to use stacked ensembleclassifier to classify gender firstly and then classify accent to improve accuracy.

Keywords: Speech Processing; Accent; HMM; GMM; MFCC; Formants

1 Introduction

Speaker and speech recognition is faced with bottleneck caused by speaker variability, in whichaccent is one of the most important factors [1] [2]. Accent is the way that particular personor group of people sound 1. Identifying accent prior to automatic speech/speaker recognitionproduces a reduced search space and lower language modeling perplexity [3].

In our work, we extracted MFCC and formant features from words. Models are built usingtime-series HMM method and non time-series GMM method. Results show that HMM withMFCC features is the optimal combination. The most discriminant words are in accordance withhuman perception. Besides, we also tested how gender affects the accent recognition. Modelsbuilt on one gender do not generalize well on samples of the other gender. We verified that atwo-layer classifier by classifying the gender firstly can help improve the accuracy of the wholesystem.

The remainder of this paper is structured as follows. Section 2 provides an overview of previousworks on accent classification. Section 3 describes our experiment design. Section 4 presentsresults and analysis. The article concludes with a discussion of results and future directions.

1http://dialectblog.com/2011/01/28/dialectvsaccent/

2 / Project Report :CSE666 (2013)

2 Related Work

Accent classification has been studied on serval levels: phoneme level [4] , word-level [2] [5] ,and whole audio level [6]. Features extracted from the whole audio level contain too much noise.And the difference between classes is not strong enough to build accent classifier. Based on thedescription of India accented English 2, we find that the essence of accent is miss-pronouncing onephoneme as another. For example, Indians prefer to pronounce /au/ as /a:/, and voiceless plosives/p/ are always unaspirated. There are few studies working on word-level accent recognition. Tanget. al [5] extracted prosody features. Levent et. al [2] extracted formants features on word level.We extended Levent’s work to try MFCC features and test gender’s effect on the system.

After we decided the granularity of the audio excerpt, the next issue concerned is the featureused to describe every audio frame. Prior researches constructed accent recognition system withspectrum information as their features. One of the most frequently used features is the Mel-frequency Cepstral Coefficients (MFCCs) [2][7][8][9]. Another kind of spectral features is formantstructure [2][6], especially the first and second formant frequencies (as Arslan and Hansen sug-gested in [10]). Liu et al. also stated that if ignoring the fundamental frequency F0, the accentclassification accuracy receives a slight degradation [11]. In our work, comparison of these threekinds of features is performed.

In addition to feature extraction, another popular topic is choice of the learning models. HiddenMarkov Model (HMM) is often utilized to model and classify temporal patterns [2][5]. HMMrecognizer has an advantage in capturing the variation of spectral features. When the statenumber of HMM is reduced to one, the Gaussian Mixture Model (GMM) will serve as the actualclassifier. The strength of GMM over HMM lies in that GMM is less time-consuming and thatGMM does not needs transcripts [7]. We will make a comparison of these two classifiers in thiswork.

3 Experiment Design

3.1 Framework

The GMU Speech Accent Archive [6] is selected as data set. The speech was sampled and digitized.Then the audio is align with word transcript. We then used mel-frequency cepstrum and formantinformation to serve as feature vectors of each word. On the classifier level, we utilized GMM anHMM for comparing their performances. The framework of our system is shown in figure 1.

The optimal feature and classifier were noted from experiments on American and Indian speak-ers. Then the optimal combination of feature and classifier was implemented for experiments ofthe gender factor.

3.2 Data Preparation

In the corpus, speeches were collected in a quiet setting. It records speeches of one same paragraphthat contains most of consonants and vowels. Basic information of every speaker, like gender, age,

2http://en.wikipedia.org/wiki/Indian English

/ Project Report :CSE666 (2013) 3

Fig. 1: System overview

Male Female

American 141 169

India 38 25

Table 1: Accent and gender distribution

etc., is provided and crawled from the website. The American-Indian data set we used is generatedby restricting the nationality and the native language. The accent and gender distribution of thedata set is shown in Table 1, and this dataset is unbalanced.

3.3 Word Alignment

Since the transcript of each speakers speech is available, we utilized the tool [12] to align it withthe speech and segment the speech word by word, after digitizing the speech. The output formatof the toolkit is in Textgrid format, which can be processed by python script and can also beloaded in Praat as shown in Fig 2.

3.4 Feature Extraction

HTK MFCC package 3 and colea package 4 are used to extract MFCC and formant features.For each word, we have three types of features: MFCCs (trough c0 to c12), formant f0-f1-f2 andformant f1-f2

3http://www.mathworks.com/matlabcentral/fileexchange/32849-htk-mfcc-matlab4http://www.mathworks.com/matlabcentral/fileexchange/108-colea


Fig. 2: Alignment Results

3.5 Machine Learning Method

GMM and HMM are adopted to train our classifiers. We conducted experiments 5 times and theaverage accuracy is reported.

The parameters for the model are obtained through 5-fold cross validation on the training dataset. In each validation experiment, every combination of different parameters was implemented,and the combination of parameters that produced the maximal accuracy was selected as the oneto train the classifier.

During the EM training process, the number of GMM components for each class is viewed asa prior. However, this parameter M may vary from circumstance to circumstance. In [6], 12 and13 were used to model American accented speech and Indian accented speech, respectively. In[8], M=32 was set. Considering our classification is based on word-level, it is necessary to lowerthe range of this number. In our experiment, this range is set as M ∈ {1, 2, 3} for both classes.

The HMM classifier we used is based on a continuous HMM with single Gaussian componentfor each hidden state. The observation sequences are the set of feature vectors extracted fromeach test word, where each feature vector is a single observation. The models are trained usingEM algorithm with random restart. The number of hidden states used in the model are chosenfrom [5,9,13] which is determined by cross validation. Generally, we expect more accurate resultswith more hidden states, as well as longer training time.

4 Results and analysis

In this section, we will report our results on features comparison, machine learning methodscomparison, words comparison and check the effect of gender on accent classification.

4.1 Features Comparison

The comparison of features is shown in Fig 4. The HMM was fixed as the classifier. MFCCs,F0F1F2 and F1F2 are extracted from each frame of every word in these three experiments,respectively. For systems of each pair, the training-test process was repeated 5 times. Theaverage of these 5 accuracies was collected and drawn in Fig 4 in Appendix. As mentioned above,before training process, 5-fold cross validation was performed to select the optimal parametersfor HMM.


(a) CAN (b) ASK (c) WILL (d) SCOOP

Fig. 3: Trajectory of discriminant words in C1-C2 of MFCC, blue lines represent American while redlines represent Indian

In this figure, we can find that: the MFCCs-HMM recognition system receives a highest ac-curacy, which states that MFCCs are more adequate features than formants; accuracy withfundamental frequency F0 is slightly better than that without F0; for classification consistencyacross words, accuracy of formants varies in a range of 15-20 percents from word to word, whilethis range of MFCCs based model is less than 10 percents. The feature of MFCCs contributes tosuch a high accuracy that we will consider it as the optimal feature extraction approach in thefollowing experiments.

4.2 Machine Learning Methods Comparison

The comparison of temporal and non-temporal machine learning methods is shown in Fig 5 inAppendix. MFCC is fixed as the approach to extract features from frames of all words. After thesame process with the previous experiment, the accuracies of GMM and HMM of 5 experimentswere obtained. In this figure, both the accuracy and its stability across words of GMM are lowerthan those of HMM, for most of words; performance of GMM are worse than HMM, which provesthat even on the word level, better results can be achieved if capturing the temporal variation offrames.

4.3 Words Comparison

The accuracy of each pair of recognition system varies from word to word. Experiments werecarried out to discover that relation between performance of classification system and the words’characteristics. The result is shown in Fig 6 in Appendix. For word ’OF’, most Indians do notdifferentiate between /v/ and /w/. For word ’TO’, the /t/ sounds like /d/ by Indians. For word’BRING’ and ’BROTHER’, the /r/ is a rolling r. For other word will good discriminant ability,we can find that C1-C2 for these words occupy different space as shown in Fig 3 .

4.4 Gender Effects

The comparison of machine learning methods is shown in Fig 7 in Appendix. Firstly, we canfind that models trained using male are not as good as models trained using female data, andless stable. Secondly,if we use both male and female training data to train the model, we do notget consistent better result than models training using female data only. Finally, if we use a twolayer classifier to classify the gender first and classify accent using model trained using data ofthe same gender, we can get better results most of the time.


5 Conclusion

In this work, we conduct word-level accent classification. Results show that HMM-MFCC modelsshow promising performance. Model trained on one gender do not generalized well on the othergender. Stacked ensemble classifier can help alleviate this problem by classifying gender firstlyand then accent.

6 Acknowledgement

The authors would like to appreciate Dr. Gutierrez-Osuna, Mr. SandeshAryal and Mr. JosephSung Lee for their valuable suggestions and continuous helps.

References

[1] Chao Huang, Tao Chen, Stan Z. Li, Eric Chang, and Jian-Lai Zhou, “Analysis of speaker variabil-ity.,” in INTERSPEECH, Paul Dalsgaard, Brge Lindberg, Henrik Benner, and Zheng-Hua Tan,Eds. 2001, pp. 1377–1380, ISCA.

[2] Levent M. Arslan, John H. L. Hansen, Prof John, and H. L. Hansen, “Language accent classificationin american english,” 1996.

[3] Fadi Biadsy, Hagen Soltau, Lidia Mangu, and Jiri Navratil Julia, “Discriminative phonotacticsfor dialect recognition using context-dependent phone classifiers,” 2010.

[4] F. William, A. Sangwan, and J.H.L. Hansen, “Automatic accent assessment using phonetic mis-match and human perception,” Audio, Speech, and Language Processing, IEEE Transactions on,vol. 21, no. 9, pp. 1818–1829, 2013.

[5] Hong Tang and Ali A. Ghorbani, “Accent classification using support vector machine and hid-den markov model,” in Proceedings of the 16th Canadian Society for Computational Studies ofIntelligence Conference on Advances in Artificial Intelligence, Berlin, Heidelberg, 2003, AI’03, pp.629–631, Springer-Verlag.

[6] S. Deshpande, S. Chikkerur, and V. Govindaraju, “Accent classification in speech,” in AutomaticIdentification Advanced Technologies, 2005. Fourth IEEE Workshop on, 2005, pp. 139–143.

[7] Tao Chen, Chao Huang, E. Chang, and Jingchun Wang, “Automatic accent identification usinggaussian mixture models,” in Automatic Speech Recognition and Understanding, 2001. ASRU ’01.IEEE Workshop on, 2001, pp. 343–346.

[8] Phuoc Nguyen, D. Tran, Xu Huang, and D. Sharma, “Automatic classification of speaker charac-teristics,” in Communications and Electronics (ICCE), 2010 Third International Conference on,2010, pp. 147–152.

[9] M.A. Zissman, “Comparison of four approaches to automatic language identification of telephonespeech,” Speech and Audio Processing, IEEE Transactions on, vol. 4, no. 1, pp. 31–, 1996.

[10] L.M. Arslan and J.H.L. Hansen, “Frequency characteristics of foreign accented speech,” in Acous-tics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on,1997, vol. 2, pp. 1123–1126 vol.2.

[11] Liu Wai Kat and P. Fung, “Fast accent identification and accented speech recognition,” inAcoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conferenceon, 1999, vol. 1, pp. 221–224 vol.1.

[12] Jiahong Yuan and Mark Liberman, “Speaker identification on the scotus corpus,” in In Proceedingsof Acoustics 2008, 2008.


7 Appendix

Fig. 4: Comparison of features using HMM training method


Fig. 5: Comparison of machine learning methods using MFCC features


Fig. 6: Rank of accuracy across words for HMM-MFCC


Fig. 7: Gender effects on classification.

Documents

Study of Word-Level Accent Classi cation and Gender Factorsstudents.cse.tamu.edu/xingwang/courses/csce666_accent_native_indian.pdf · 2 / Project Report :CSE666 (2013) 2 Related Work