Accent Classiﬁcation using Neural Networks · 2012. 10. 26. · Introduction to Accent Classi cation with Neural Networks 1 1.1 Overview Although seemingly subtle, accents have

Accent Classification using NeuralNetworks

Collection Editor:Andrea Trevino

Accent Classification using NeuralNetworks

Collection Editor:Andrea Trevino

Authors:Scott NovichPhil Repicky

Andrea Trevino

Online:< http://cnx.org/content/col10320/1.1/ >

C O N N E X I O N S

Rice University, Houston, Texas

This selection and arrangement of content as a collection is copyrighted by Andrea Trevino. It is licensed under the

Creative Commons Attribution 2.0 license (http://creativecommons.org/licenses/by/2.0/).

Collection structure revised: December 15, 2005

PDF generated: October 25, 2012

For copyright and attribution information for the modules contained in this collection, see p. 34.

Table of Contents

1 Introduction to Accent Classi�cation with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Formants and Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Collection of Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Extracting Formants from Vowel Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Neural Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Neural Network-based Accent Classi�cation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Conclusions and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Attributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iv

Available for free at Connexions

Chapter 1

Introduction to Accent Classi�cation

with Neural Networks1

1.1 Overview

Although seemingly subtle, accents have important in�uences in many areas � from business, to sociology,technology, security, and intelligence. While much linguistic analysis has been done on the subject matter,very little work has done with regards to potential applications.

1.1.1 Goals

The goal of this project is to generate a process for accurate accent detection. The algorithm developedshould have the �exibility to choose how many accents to di�erentiate between. Currently, the algorithmis aimed at di�erentiating accents by languages, rather than regions, but should be able to conform to thelatter as well. Finally, the application should produce an output showing the relative strength of a speaker'sprimary accent compared to the rest in the system.

1.1.2 Design Choices

The agreed-upon option for achieving the desired �exibility in the project's algorithm is to use a neuralnetwork. A neural network is a matrix containing weights that correspond to how certain parameters fed tothe network tie the inputs to the outputs. Parameters of known inputs with corresponding outputs are fedto the network to train it. Training the network produces the weighted matrix, to which test samples canthen be fed. This provides a powerful and �exible tool that can be used to generate the desired algorithm.

Utilizing this limits the project group only by the amount of overall samples collected to train the matrixwith, and how they are de�ned. For this project, approximately 800 samples from over 70 people have beencollected for the purposes of training and testing. The group of language-based accents to test with consistsof American Northern English, American Texan English, Farsi, Russian, Mandarin, and Korean.

1.1.3 Applications

Potential applications for this project are incredibly diverse. One example might be for tagging informationabout a subject for intelligence purposes. The program could also be used as a potential aid/error check forvoice-recognition based systems such as customer service or bioinformatics in security systems. The projectcan even aid in showing a student's progress in learning a foreign language.

1This content is available online at .


1

2CHAPTER 1. INTRODUCTION TO ACCENT CLASSIFICATION WITH

NEURAL NETWORKS


Chapter 2

Formants and Phonetics1

Taking the FFT (Fast Fourier Transform) of each voice sample outputs its frequency spectrum. A formantis one of four highest peaks in a spectrum sample. From the frequency spectrums, the main formants can beextracted. It is the location of these formants along the frequency axis that de�ne a vowel sound. There arefour main peaks between 300 and 4400 Hz, this bandwidth is where the strongest formants for human speechoccur. For the purposes of this project, the group is to extract the frequency values of only the �rst twopeaks since they provide the most information in terms of what the vowel sound is. Since all vowels followconstant and recognizable patterns in these two formants, the changes along an accent can be recorded witha high degree of accuracy. Figure 1 shows this pattern between the vowel sounds and formant frequencies.



3

4 CHAPTER 2. FORMANTS AND PHONETICS

The IPA Vowel Chart

Figure 2.1

The �rst formant (F1) is dependant on whether a vowel sound is more open or closed, so on the chart,F1 varies along the y axis. F1 increases in frequency as the vowel becomes more open and decreases to itsminimum as the vowel sound closes. The second formant (F2), however, follows along the x-axis. Thus, itvaries depending on whether a sound is made in the front or the back of the vocal cavity. F2 increases infrequency the farther forward that a vowel is and decreases to its minimum as a vowel moves to the back.Therefore, each vowel sound has unique, characteristic formant values for its �rst two formants. With this inmind, it means that theoretically, across many speakers, the same frequency values for the �rst two formantlocations should hold as long as they are making the same vowel sound.


5

2.1 Sample Spectograms

a as in call

Figure 2.2: The �rst and second formant have similar values

i as in please

Figure 2.3: Very high f2 and small f1


6 CHAPTER 2. FORMANTS AND PHONETICS


Chapter 3

Collection of Samples1

3.1 Choosing the sample set

We decided that one sample for each of the English vowels on the IPA chart would be a fairly thoroughrepresentative sample of each individual's accent. With the inclusion of the two diphthongs that we alsoextracted, we took 14 vowel samples per person. We used the following paragraph in each recording; thereare at least 4 instances of each vowel sound located throughout it.

Figure 3.1

Phonetic Background(Please i, call a, Ste-ε, -lla , Ask æ, spoons u , �ve ai, brother ^, Bob α, Big I, toy oi, frog ν, go o, station e)

The vowels in bold are the ones we decided to extract; we determined that these would provide thecleanest formants for the whole paragraph. For example, the `oo' in `spoons' was chosen due to the `p' thatprecedes the vowel. The `p' sound creates a stop of air, almost like a vocal `clear'. A `t' will also do this,which explains our choice of the `a' in `station'.



7

8 CHAPTER 3. COLLECTION OF SAMPLES

Spoons

Figure 3.2: The stop made by the 'p' is visible as a lack of formants

The two diphthongs present are the `ai' from `�ve' and `oi' from `toy'. In these samples, the formantvalues move smoothly from the �rst vowel representation to the second.

The vowel samples that we cut out of the main paragraph ended up being about 0.04 seconds each withdiphthongs being much longer to capture the entire sample transition.


Chapter 4

Extracting Formants from Vowel

Samples1

For each vowel sample we needed to extract the �rst and second formant frequencies. To do this we made afunction in MATLAB that we could then apply to each speaker's vowel samples quickly. In an ideal worldwith clear speech this would be a straightforward process, since there would be two or more peaks on thefrequency spectrum with little oscillation. The formants would be simply be the locations of the �rst twopeaks.



9

10 CHAPTER 4. EXTRACTING FORMANTS FROM VOWEL SAMPLES

Figure 4.1

However, very few of the samples are this clear. If the formants do not stay constant during the entireclip, then the formant peaks have smaller peaks on them. In order to solve this problem we did three things.First we cut the samples into thirds, found the formants in each division, and then averaged the three valuesfor a �nal formant value. Second, we ignored frequencies below 300 Hz which correspond to frequencies madewhen the human vocal tract makes a sound. Finally we �ltered our frequency spectrum data to remove noisefrom the peaks. We also experimented with cubing the spectrum, but the second formant was generallysmall and cubing the signal made it harder to �nd. As a guide for the accuracy of our answer we used theopen source application Praat. Praat can accurately �nd the formants using a more advanced techniques.


11

Figure 4.2

With the aid of Praat, the �rst and second formants should be 569.7 Hz and 930.3 Hz. In the un�lteredspectrum there is a strong peak just above 300Hz which does not correspond to a formant, in the �lteredspectrum it is removed.

To locate the �rst formant we started by �nding the maximum value in the spectrum. However, sometimesthe second formant is stronger than the �rst, so we looked for another peak before this �rst guess above athreshold (1.5 on the normalized scale). If a peak could not be found before the maximum to be the �rstformant, then we had to search for a second formant beyond the �rst. We did this in the exact same manneras �nding the �rst, but we only looked at the part of the spectrum above the minimum immediately followingthe �rst peak. We found this minimum with the aid of the derivative.

This function was used on each vowel sample to generate an accent pro�le for each speaker. The pro�leconsisted of the �rst and second formants of the speaker's 14 vowels in a column vector.


12 CHAPTER 4. EXTRACTING FORMANTS FROM VOWEL SAMPLES


Chapter 5

Neural Network Design1

To implement our neural network we used the Neural Network Toolbox in MATLAB. The neural network isbuilt up of layers of neurons. Each neuron can either accept a vector or scalar input (p) and gives a scalaroutput (a). The inputs are weighted by W and given a bias b. This results in the inputs becoming Wp + b.The neuron transfer function operates on this value to generate the �nal scalar output a.

A MATLAB Neuron that Accepts a Vector Input

Figure 5.1

Our network used three layers of neurons, one of which is required by the toolbox. The �nal layer, outputlayer, is required to have neurons equal to the size of the output. We tested �ve accents, so our �nal layer



13

14 CHAPTER 5. NEURAL NETWORK DESIGN

has 5 neurons. We also added two "hidden" layers, which operate on the inputs before they are prepared asoutputs, each of which have 20 neurons.

In addition to con�guring the network parameters, we had to build the network training set. In ourtraining set we had 42 speakers: 8 Northern, 9 Texan, 9 Russian, 9 Farsi, and 7 Mandarin. An accent pro�lewas created for each of these speakers as discussed and compacted into a matrix. Each pro�le was a columnvector, so the size was 42 x 28. For each speaker we also generated an answer vector. For example, thedesired answer for a Texan accent is [0 1 0 0 0]. These answer vectors were also combined into an answermatrix. The training matrix and the desired answer matrix were given to the neural network which trainedusing traingda (gradient descent with adaptive learning rate backpropogation). We set the goal for thetraining function to be a mean square error of .005.

We originally con�gured our neural network to use neurons with a linear transfer function (purelin),however when using more than three accents at a time we could not reduce the mean square error to .005The error approached a limit, which increased as the number of accents we included increased.

Linear Neuron Transfer Function

Figure 5.2


15

Linear Neurons Training Curve

Figure 5.3

So, at this point we redesigned our network to use non-linear neurons (tansig).



Tansig Neuron Transfer Function

Figure 5.4


17

Tansig Neurons Training Curve

Figure 5.5

After the network was trained we re�ned our set of training samples by looking at the network's outputwhen given the training matrix again. We removed a handful of speakers to arrive at our present numberof 42 because they included an accent we weren't explicitly testing for. These consisted of speakers whosounded as if they did not learn American English, but British English.

These �nal two �gures show an image representation of the answer matrix and the answers given bythe trained matrix. In the images, grey is 0 and white is one. Colors darker than grey represent negativenumbers.



Answer Matrix

Figure 5.6


19

Trained Answers

Figure 5.7


Chapter 6

Neural Network-based Accent

Classi�cation Results1

6.1 Results

The following are some example outputs from the neural network from various test speakers. The outputdisplays relative strengths of di�erent types of accents prevalent in a particular subject. All test inputs werenot used in the training matrix. Overall, approximately 20 tests were conducted with about an 80% successrate. Those that failed tended to with good reason (either inadequate recording quality, or speakers who didnot provide accurate information about what their accent is comprised of � a common issue with subjectswho have lived in multiple places).

The charts below show accents in the following order: Northern US, Texan US, Russian, Farsi, andMandarin



21

22CHAPTER 6. NEURAL NETWORK-BASED ACCENT CLASSIFICATION

RESULTS

6.1.1 Test 1: Chinese Subject

Chinese Subject

Figure 6.1: Chinese Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test1-china-northeast.mp3

Here our network has successfully picked out the accent of our subject. Secondarily, the network picked upon a slight Texan accent, possibly showing the in�uence of location on the subject (The sample was recordedin Texas).


23

6.1.2 Test 2: Iranian Subject

Iranian Subject

Figure 6.2: Iranian Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test2-farsi-11.mp3

Again our network has successfully picked out the accent of our subject. Once again, this sample wasrecorded in Texas, which could account for the secondary in�uence of a Texan accent in the subject.



RESULTS


Chinese Subject


This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test3-mandarin-6.mp3

Once again, the network successfully picks up on the subjects primary accent as well as in�uence of a Texanaccent (this sample was also recorded in Texas).


25


Chinese Subject


This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test4-mandarin-10.mp3

A successful test showing little or no in�uence from other accents in the network.



RESULTS

6.1.5 Test 5: American Subject (Hybrid of Regions)

American Subject (Hybrid)

Figure 6.5: American Subject - Hybrid (accent order: Northern US, Texan US, Russian, Farsi, andMandarin)

This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test5-amer-south-north-test7.mp3

Results from a subject who has lived all over � mainly in Texas, who's accent appears to sound more Northern(which seems relatively true if one listens to the source recording).


27

6.1.6 Test 6: Russian Subject

Russian Subject

Figure 6.6: Russian Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test6-russian-8.mp3

Successful test of a Russian subject with strong in�uences of a Northern US accent.



RESULTS

6.1.7 Test 7: Russian Subject

Russian Subject

Figure 6.7: Russian Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test7-russian11-test8.mp3

Another successful test of a Russian subject with strong in�uences of a Northern US accent.


29

6.1.8 Test 8: Cantonese Subject

Cantonese Subject

Figure 6.8: Cantonese Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test8-wai-lam.mp3

Successful region-based test of a Cantonese subject who has been living in the US.



RESULTS

6.1.9 Test 1: Korean Subject

Korean Subject

Figure 6.9: Korean Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test9-korean-3.mp3

An interesting example of throwing an accent at the network that doesn't �t into any of the categories.


Chapter 7

Conclusions and References1

7.1 Conclusions

Our results showed that vowel formant analysis provides accurate information on a person's overall speechand accent. However, the di�erences are not in how speakers make the vowel sounds, but by what vowelsounds are made when speaking certain letter groupings. They also showed that neural networks are aviable solution for generating an accent detection and classi�cation algorithm. Because of the nature ofneural networks, we can improve the performance of the system by feeding more training data into thenetwork. We can also improve the performance of our system by using a better formant detector. Onesuggestion we received from the creator of Praat was to use pre-emphasis to make the formant peaks moreobvious, even if one formant peak is on another formant's slope.

7.2 Acknowledgements

We would like to thank all the people who allowed us to record their voices and Dr. Bill for providing us witha high quality microphone. Secondly, we'd like to thank Andrew Harrison for giving our team a crash-coursein linguistics and phonetics. Finally, we'd like to especially thank ca�eine, without it this project wouldnever have been �nished.

7.3 References

Source: http://accent.gmu.edu Source: http://en.wikipedia.org/wiki/Formant



31

32 CHAPTER 7. CONCLUSIONS AND REFERENCES

Figure 7.1


INDEX 33

Index of Keywords and Terms

Keywords are listed by the section with that keyword (page numbers are in parentheses). Keywordsdo not necessarily appear in the text of the page. They are merely associated with that section. Ex.apples, 1.1 (1) Terms are referenced by the page they appear on. Ex. apples, 1

A Accent, 1(1), 6(21)Accent Classi�cation, 1(1), 6(21)Accent Detection, 1(1), 6(21)Accent Recognition, 1(1), 6(21)

C Classi�cation, 1(1), 6(21)conclusions, 7(31)

D Detection, 1(1), 6(21)diphthong, 3(7)

E extract, 3(7)

F formant, 2(3), 4(9)

H human speech, 2(3)

I ipa, 2(3)

N Neural, 1(1), 6(21)Neural Network, 1(1), 5(13), 6(21)

R Recognition, 1(1), 6(21)

S sample, 3(7)

V vowel, 2(3), 3(7), 4(9)


34 ATTRIBUTIONS

Attributions

Collection: Accent Classi�cation using Neural NetworksEdited by: Andrea TrevinoURL: http://cnx.org/content/col10320/1.1/License: http://creativecommons.org/licenses/by/2.0/

Module: "Introduction to Accent Classi�cation with Neural Networks"By: Scott NovichURL: http://cnx.org/content/m13214/1.1/Page: 1Copyright: Scott NovichLicense: http://creativecommons.org/licenses/by/2.0/

Module: "Formants and Phonetics"By: Andrea TrevinoURL: http://cnx.org/content/m13209/1.2/Pages: 3-5Copyright: Andrea TrevinoLicense: http://creativecommons.org/licenses/by/2.0/

Module: "Collection of Samples"By: Andrea TrevinoURL: http://cnx.org/content/m13208/1.3/Pages: 7-8Copyright: Andrea TrevinoLicense: http://creativecommons.org/licenses/by/2.0/

Module: "Extracting Formants from Vowel Samples"By: Phil RepickyURL: http://cnx.org/content/m13216/1.1/Pages: 9-11Copyright: Phil RepickyLicense: http://creativecommons.org/licenses/by/2.0/

Module: "Neural Network Design"By: Phil RepickyURL: http://cnx.org/content/m13228/1.1/Pages: 13-19Copyright: Phil RepickyLicense: http://creativecommons.org/licenses/by/2.0/

Module: "Neural Network-based Accent Classi�cation Results"By: Scott NovichURL: http://cnx.org/content/m13231/1.1/Pages: 21-30Copyright: Scott NovichLicense: http://creativecommons.org/licenses/by/2.0/


ATTRIBUTIONS 35

Module: "Conclusions and References"By: Phil RepickyURL: http://cnx.org/content/m13210/1.1/Pages: 31-32Copyright: Phil RepickyLicense: http://creativecommons.org/licenses/by/2.0/


Accent Classi�cation using Neural NetworksCreating an algorithm for detecting and classifying accents using formant analysis and neural networks.

About ConnexionsSince 1999, Connexions has been pioneering a global system where anyone can create course materials andmake them fully accessible and easily reusable free of charge. We are a Web-based authoring, teaching andlearning environment open to anyone interested in education, including students, teachers, professors andlifelong learners. We connect ideas and facilitate educational communities.

Connexions's modular, interactive courses are in use worldwide by universities, community colleges, K-12schools, distance learners, and lifelong learners. Connexions materials are in many languages, includingEnglish, Spanish, Chinese, Japanese, Italian, Vietnamese, French, Portuguese, and Thai. Connexions is partof an exciting new information distribution system that allows for Print on Demand Books. Connexionshas partnered with innovative on-demand publisher QOOP to accelerate the delivery of printed coursematerials and textbooks into classrooms worldwide at lower prices than traditional academic publishers.

Documents

Accent Classiﬁcation using Neural Networks · 2012. 10. 26. · Introduction to Accent Classi cation with Neural Networks 1 1.1 Overview Although seemingly subtle, accents have