42
Accent Classification using Neural Networks Collection Editor: Andrea Trevino

Accent Classification using Neural Networks · 2012. 10. 26. · Introduction to Accent Classi cation with Neural Networks 1 1.1 Overview Although seemingly subtle, accents have

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Accent Classification using NeuralNetworks

    Collection Editor:Andrea Trevino

  • Accent Classification using NeuralNetworks

    Collection Editor:Andrea Trevino

    Authors:Scott NovichPhil Repicky

    Andrea Trevino

    Online:< http://cnx.org/content/col10320/1.1/ >

    C O N N E X I O N S

    Rice University, Houston, Texas

  • This selection and arrangement of content as a collection is copyrighted by Andrea Trevino. It is licensed under the

    Creative Commons Attribution 2.0 license (http://creativecommons.org/licenses/by/2.0/).

    Collection structure revised: December 15, 2005

    PDF generated: October 25, 2012

    For copyright and attribution information for the modules contained in this collection, see p. 34.

  • Table of Contents

    1 Introduction to Accent Classi�cation with Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Formants and Phonetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Collection of Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Extracting Formants from Vowel Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Neural Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Neural Network-based Accent Classi�cation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Conclusions and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Attributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

  • iv

    Available for free at Connexions

  • Chapter 1

    Introduction to Accent Classi�cation

    with Neural Networks1

    1.1 Overview

    Although seemingly subtle, accents have important in�uences in many areas � from business, to sociology,technology, security, and intelligence. While much linguistic analysis has been done on the subject matter,very little work has done with regards to potential applications.

    1.1.1 Goals

    The goal of this project is to generate a process for accurate accent detection. The algorithm developedshould have the �exibility to choose how many accents to di�erentiate between. Currently, the algorithmis aimed at di�erentiating accents by languages, rather than regions, but should be able to conform to thelatter as well. Finally, the application should produce an output showing the relative strength of a speaker'sprimary accent compared to the rest in the system.

    1.1.2 Design Choices

    The agreed-upon option for achieving the desired �exibility in the project's algorithm is to use a neuralnetwork. A neural network is a matrix containing weights that correspond to how certain parameters fed tothe network tie the inputs to the outputs. Parameters of known inputs with corresponding outputs are fedto the network to train it. Training the network produces the weighted matrix, to which test samples canthen be fed. This provides a powerful and �exible tool that can be used to generate the desired algorithm.

    Utilizing this limits the project group only by the amount of overall samples collected to train the matrixwith, and how they are de�ned. For this project, approximately 800 samples from over 70 people have beencollected for the purposes of training and testing. The group of language-based accents to test with consistsof American Northern English, American Texan English, Farsi, Russian, Mandarin, and Korean.

    1.1.3 Applications

    Potential applications for this project are incredibly diverse. One example might be for tagging informationabout a subject for intelligence purposes. The program could also be used as a potential aid/error check forvoice-recognition based systems such as customer service or bioinformatics in security systems. The projectcan even aid in showing a student's progress in learning a foreign language.

    1This content is available online at .

    Available for free at Connexions

    1

  • 2CHAPTER 1. INTRODUCTION TO ACCENT CLASSIFICATION WITH

    NEURAL NETWORKS

    Available for free at Connexions

  • Chapter 2

    Formants and Phonetics1

    Taking the FFT (Fast Fourier Transform) of each voice sample outputs its frequency spectrum. A formantis one of four highest peaks in a spectrum sample. From the frequency spectrums, the main formants can beextracted. It is the location of these formants along the frequency axis that de�ne a vowel sound. There arefour main peaks between 300 and 4400 Hz, this bandwidth is where the strongest formants for human speechoccur. For the purposes of this project, the group is to extract the frequency values of only the �rst twopeaks since they provide the most information in terms of what the vowel sound is. Since all vowels followconstant and recognizable patterns in these two formants, the changes along an accent can be recorded witha high degree of accuracy. Figure 1 shows this pattern between the vowel sounds and formant frequencies.

    1This content is available online at .

    Available for free at Connexions

    3

  • 4 CHAPTER 2. FORMANTS AND PHONETICS

    The IPA Vowel Chart

    Figure 2.1

    The �rst formant (F1) is dependant on whether a vowel sound is more open or closed, so on the chart,F1 varies along the y axis. F1 increases in frequency as the vowel becomes more open and decreases to itsminimum as the vowel sound closes. The second formant (F2), however, follows along the x-axis. Thus, itvaries depending on whether a sound is made in the front or the back of the vocal cavity. F2 increases infrequency the farther forward that a vowel is and decreases to its minimum as a vowel moves to the back.Therefore, each vowel sound has unique, characteristic formant values for its �rst two formants. With this inmind, it means that theoretically, across many speakers, the same frequency values for the �rst two formantlocations should hold as long as they are making the same vowel sound.

    Available for free at Connexions

  • 5

    2.1 Sample Spectograms

    a as in call

    Figure 2.2: The �rst and second formant have similar values

    i as in please

    Figure 2.3: Very high f2 and small f1

    Available for free at Connexions

  • 6 CHAPTER 2. FORMANTS AND PHONETICS

    Available for free at Connexions

  • Chapter 3

    Collection of Samples1

    3.1 Choosing the sample set

    We decided that one sample for each of the English vowels on the IPA chart would be a fairly thoroughrepresentative sample of each individual's accent. With the inclusion of the two diphthongs that we alsoextracted, we took 14 vowel samples per person. We used the following paragraph in each recording; thereare at least 4 instances of each vowel sound located throughout it.

    Figure 3.1

    Phonetic Background(Please i, call a, Ste-ε, -lla , Ask æ, spoons u , �ve ai, brother ^, Bob α, Big I, toy oi, frog ν, go o, station e)

    The vowels in bold are the ones we decided to extract; we determined that these would provide thecleanest formants for the whole paragraph. For example, the `oo' in `spoons' was chosen due to the `p' thatprecedes the vowel. The `p' sound creates a stop of air, almost like a vocal `clear'. A `t' will also do this,which explains our choice of the `a' in `station'.

    1This content is available online at .

    Available for free at Connexions

    7

  • 8 CHAPTER 3. COLLECTION OF SAMPLES

    Spoons

    Figure 3.2: The stop made by the 'p' is visible as a lack of formants

    The two diphthongs present are the `ai' from `�ve' and `oi' from `toy'. In these samples, the formantvalues move smoothly from the �rst vowel representation to the second.

    The vowel samples that we cut out of the main paragraph ended up being about 0.04 seconds each withdiphthongs being much longer to capture the entire sample transition.

    Available for free at Connexions

  • Chapter 4

    Extracting Formants from Vowel

    Samples1

    For each vowel sample we needed to extract the �rst and second formant frequencies. To do this we made afunction in MATLAB that we could then apply to each speaker's vowel samples quickly. In an ideal worldwith clear speech this would be a straightforward process, since there would be two or more peaks on thefrequency spectrum with little oscillation. The formants would be simply be the locations of the �rst twopeaks.

    1This content is available online at .

    Available for free at Connexions

    9

  • 10 CHAPTER 4. EXTRACTING FORMANTS FROM VOWEL SAMPLES

    Figure 4.1

    However, very few of the samples are this clear. If the formants do not stay constant during the entireclip, then the formant peaks have smaller peaks on them. In order to solve this problem we did three things.First we cut the samples into thirds, found the formants in each division, and then averaged the three valuesfor a �nal formant value. Second, we ignored frequencies below 300 Hz which correspond to frequencies madewhen the human vocal tract makes a sound. Finally we �ltered our frequency spectrum data to remove noisefrom the peaks. We also experimented with cubing the spectrum, but the second formant was generallysmall and cubing the signal made it harder to �nd. As a guide for the accuracy of our answer we used theopen source application Praat. Praat can accurately �nd the formants using a more advanced techniques.

    Available for free at Connexions

  • 11

    Figure 4.2

    With the aid of Praat, the �rst and second formants should be 569.7 Hz and 930.3 Hz. In the un�lteredspectrum there is a strong peak just above 300Hz which does not correspond to a formant, in the �lteredspectrum it is removed.

    To locate the �rst formant we started by �nding the maximum value in the spectrum. However, sometimesthe second formant is stronger than the �rst, so we looked for another peak before this �rst guess above athreshold (1.5 on the normalized scale). If a peak could not be found before the maximum to be the �rstformant, then we had to search for a second formant beyond the �rst. We did this in the exact same manneras �nding the �rst, but we only looked at the part of the spectrum above the minimum immediately followingthe �rst peak. We found this minimum with the aid of the derivative.

    This function was used on each vowel sample to generate an accent pro�le for each speaker. The pro�leconsisted of the �rst and second formants of the speaker's 14 vowels in a column vector.

    Available for free at Connexions

  • 12 CHAPTER 4. EXTRACTING FORMANTS FROM VOWEL SAMPLES

    Available for free at Connexions

  • Chapter 5

    Neural Network Design1

    To implement our neural network we used the Neural Network Toolbox in MATLAB. The neural network isbuilt up of layers of neurons. Each neuron can either accept a vector or scalar input (p) and gives a scalaroutput (a). The inputs are weighted by W and given a bias b. This results in the inputs becoming Wp + b.The neuron transfer function operates on this value to generate the �nal scalar output a.

    A MATLAB Neuron that Accepts a Vector Input

    Figure 5.1

    Our network used three layers of neurons, one of which is required by the toolbox. The �nal layer, outputlayer, is required to have neurons equal to the size of the output. We tested �ve accents, so our �nal layer

    1This content is available online at .

    Available for free at Connexions

    13

  • 14 CHAPTER 5. NEURAL NETWORK DESIGN

    has 5 neurons. We also added two "hidden" layers, which operate on the inputs before they are prepared asoutputs, each of which have 20 neurons.

    In addition to con�guring the network parameters, we had to build the network training set. In ourtraining set we had 42 speakers: 8 Northern, 9 Texan, 9 Russian, 9 Farsi, and 7 Mandarin. An accent pro�lewas created for each of these speakers as discussed and compacted into a matrix. Each pro�le was a columnvector, so the size was 42 x 28. For each speaker we also generated an answer vector. For example, thedesired answer for a Texan accent is [0 1 0 0 0]. These answer vectors were also combined into an answermatrix. The training matrix and the desired answer matrix were given to the neural network which trainedusing traingda (gradient descent with adaptive learning rate backpropogation). We set the goal for thetraining function to be a mean square error of .005.

    We originally con�gured our neural network to use neurons with a linear transfer function (purelin),however when using more than three accents at a time we could not reduce the mean square error to .005The error approached a limit, which increased as the number of accents we included increased.

    Linear Neuron Transfer Function

    Figure 5.2

    Available for free at Connexions

  • 15

    Linear Neurons Training Curve

    Figure 5.3

    So, at this point we redesigned our network to use non-linear neurons (tansig).

    Available for free at Connexions

  • 16 CHAPTER 5. NEURAL NETWORK DESIGN

    Tansig Neuron Transfer Function

    Figure 5.4

    Available for free at Connexions

  • 17

    Tansig Neurons Training Curve

    Figure 5.5

    After the network was trained we re�ned our set of training samples by looking at the network's outputwhen given the training matrix again. We removed a handful of speakers to arrive at our present numberof 42 because they included an accent we weren't explicitly testing for. These consisted of speakers whosounded as if they did not learn American English, but British English.

    These �nal two �gures show an image representation of the answer matrix and the answers given bythe trained matrix. In the images, grey is 0 and white is one. Colors darker than grey represent negativenumbers.

    Available for free at Connexions

  • 18 CHAPTER 5. NEURAL NETWORK DESIGN

    Answer Matrix

    Figure 5.6

    Available for free at Connexions

  • 19

    Trained Answers

    Figure 5.7

    Available for free at Connexions

  • 20 CHAPTER 5. NEURAL NETWORK DESIGN

    Available for free at Connexions

  • Chapter 6

    Neural Network-based Accent

    Classi�cation Results1

    6.1 Results

    The following are some example outputs from the neural network from various test speakers. The outputdisplays relative strengths of di�erent types of accents prevalent in a particular subject. All test inputs werenot used in the training matrix. Overall, approximately 20 tests were conducted with about an 80% successrate. Those that failed tended to with good reason (either inadequate recording quality, or speakers who didnot provide accurate information about what their accent is comprised of � a common issue with subjectswho have lived in multiple places).

    The charts below show accents in the following order: Northern US, Texan US, Russian, Farsi, andMandarin

    1This content is available online at .

    Available for free at Connexions

    21

  • 22CHAPTER 6. NEURAL NETWORK-BASED ACCENT CLASSIFICATION

    RESULTS

    6.1.1 Test 1: Chinese Subject

    Chinese Subject

    Figure 6.1: Chinese Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

    This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test1-china-northeast.mp3

    Here our network has successfully picked out the accent of our subject. Secondarily, the network picked upon a slight Texan accent, possibly showing the in�uence of location on the subject (The sample was recordedin Texas).

    Available for free at Connexions

  • 23

    6.1.2 Test 2: Iranian Subject

    Iranian Subject

    Figure 6.2: Iranian Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

    This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test2-farsi-11.mp3

    Again our network has successfully picked out the accent of our subject. Once again, this sample wasrecorded in Texas, which could account for the secondary in�uence of a Texan accent in the subject.

    Available for free at Connexions

  • 24CHAPTER 6. NEURAL NETWORK-BASED ACCENT CLASSIFICATION

    RESULTS

    6.1.3 Test 3: Chinese Subject

    Chinese Subject

    Figure 6.3: Chinese Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

    This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test3-mandarin-6.mp3

    Once again, the network successfully picks up on the subjects primary accent as well as in�uence of a Texanaccent (this sample was also recorded in Texas).

    Available for free at Connexions

  • 25

    6.1.4 Test 4: Chinese Subject

    Chinese Subject

    Figure 6.4: Chinese Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

    This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test4-mandarin-10.mp3

    A successful test showing little or no in�uence from other accents in the network.

    Available for free at Connexions

  • 26CHAPTER 6. NEURAL NETWORK-BASED ACCENT CLASSIFICATION

    RESULTS

    6.1.5 Test 5: American Subject (Hybrid of Regions)

    American Subject (Hybrid)

    Figure 6.5: American Subject - Hybrid (accent order: Northern US, Texan US, Russian, Farsi, andMandarin)

    This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test5-amer-south-north-test7.mp3

    Results from a subject who has lived all over � mainly in Texas, who's accent appears to sound more Northern(which seems relatively true if one listens to the source recording).

    Available for free at Connexions

  • 27

    6.1.6 Test 6: Russian Subject

    Russian Subject

    Figure 6.6: Russian Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

    This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test6-russian-8.mp3

    Successful test of a Russian subject with strong in�uences of a Northern US accent.

    Available for free at Connexions

  • 28CHAPTER 6. NEURAL NETWORK-BASED ACCENT CLASSIFICATION

    RESULTS

    6.1.7 Test 7: Russian Subject

    Russian Subject

    Figure 6.7: Russian Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

    This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test7-russian11-test8.mp3

    Another successful test of a Russian subject with strong in�uences of a Northern US accent.

    Available for free at Connexions

  • 29

    6.1.8 Test 8: Cantonese Subject

    Cantonese Subject

    Figure 6.8: Cantonese Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

    This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test8-wai-lam.mp3

    Successful region-based test of a Cantonese subject who has been living in the US.

    Available for free at Connexions

  • 30CHAPTER 6. NEURAL NETWORK-BASED ACCENT CLASSIFICATION

    RESULTS

    6.1.9 Test 1: Korean Subject

    Korean Subject

    Figure 6.9: Korean Subject (accent order: Northern US, Texan US, Russian, Farsi, and Mandarin)

    This is an unsupported media type. To view, please seehttp://cnx.org/content/m13231/latest/test9-korean-3.mp3

    An interesting example of throwing an accent at the network that doesn't �t into any of the categories.

    Available for free at Connexions

  • Chapter 7

    Conclusions and References1

    7.1 Conclusions

    Our results showed that vowel formant analysis provides accurate information on a person's overall speechand accent. However, the di�erences are not in how speakers make the vowel sounds, but by what vowelsounds are made when speaking certain letter groupings. They also showed that neural networks are aviable solution for generating an accent detection and classi�cation algorithm. Because of the nature ofneural networks, we can improve the performance of the system by feeding more training data into thenetwork. We can also improve the performance of our system by using a better formant detector. Onesuggestion we received from the creator of Praat was to use pre-emphasis to make the formant peaks moreobvious, even if one formant peak is on another formant's slope.

    7.2 Acknowledgements

    We would like to thank all the people who allowed us to record their voices and Dr. Bill for providing us witha high quality microphone. Secondly, we'd like to thank Andrew Harrison for giving our team a crash-coursein linguistics and phonetics. Finally, we'd like to especially thank ca�eine, without it this project wouldnever have been �nished.

    7.3 References

    Source: http://accent.gmu.edu Source: http://en.wikipedia.org/wiki/Formant

    1This content is available online at .

    Available for free at Connexions

    31

  • 32 CHAPTER 7. CONCLUSIONS AND REFERENCES

    Figure 7.1

    Available for free at Connexions

  • INDEX 33

    Index of Keywords and Terms

    Keywords are listed by the section with that keyword (page numbers are in parentheses). Keywordsdo not necessarily appear in the text of the page. They are merely associated with that section. Ex.apples, 1.1 (1) Terms are referenced by the page they appear on. Ex. apples, 1

    A Accent, 1(1), 6(21)Accent Classi�cation, 1(1), 6(21)Accent Detection, 1(1), 6(21)Accent Recognition, 1(1), 6(21)

    C Classi�cation, 1(1), 6(21)conclusions, 7(31)

    D Detection, 1(1), 6(21)diphthong, 3(7)

    E extract, 3(7)

    F formant, 2(3), 4(9)

    H human speech, 2(3)

    I ipa, 2(3)

    N Neural, 1(1), 6(21)Neural Network, 1(1), 5(13), 6(21)

    R Recognition, 1(1), 6(21)

    S sample, 3(7)

    V vowel, 2(3), 3(7), 4(9)

    Available for free at Connexions

  • 34 ATTRIBUTIONS

    Attributions

    Collection: Accent Classi�cation using Neural NetworksEdited by: Andrea TrevinoURL: http://cnx.org/content/col10320/1.1/License: http://creativecommons.org/licenses/by/2.0/

    Module: "Introduction to Accent Classi�cation with Neural Networks"By: Scott NovichURL: http://cnx.org/content/m13214/1.1/Page: 1Copyright: Scott NovichLicense: http://creativecommons.org/licenses/by/2.0/

    Module: "Formants and Phonetics"By: Andrea TrevinoURL: http://cnx.org/content/m13209/1.2/Pages: 3-5Copyright: Andrea TrevinoLicense: http://creativecommons.org/licenses/by/2.0/

    Module: "Collection of Samples"By: Andrea TrevinoURL: http://cnx.org/content/m13208/1.3/Pages: 7-8Copyright: Andrea TrevinoLicense: http://creativecommons.org/licenses/by/2.0/

    Module: "Extracting Formants from Vowel Samples"By: Phil RepickyURL: http://cnx.org/content/m13216/1.1/Pages: 9-11Copyright: Phil RepickyLicense: http://creativecommons.org/licenses/by/2.0/

    Module: "Neural Network Design"By: Phil RepickyURL: http://cnx.org/content/m13228/1.1/Pages: 13-19Copyright: Phil RepickyLicense: http://creativecommons.org/licenses/by/2.0/

    Module: "Neural Network-based Accent Classi�cation Results"By: Scott NovichURL: http://cnx.org/content/m13231/1.1/Pages: 21-30Copyright: Scott NovichLicense: http://creativecommons.org/licenses/by/2.0/

    Available for free at Connexions

  • ATTRIBUTIONS 35

    Module: "Conclusions and References"By: Phil RepickyURL: http://cnx.org/content/m13210/1.1/Pages: 31-32Copyright: Phil RepickyLicense: http://creativecommons.org/licenses/by/2.0/

    Available for free at Connexions

  • Accent Classi�cation using Neural NetworksCreating an algorithm for detecting and classifying accents using formant analysis and neural networks.

    About ConnexionsSince 1999, Connexions has been pioneering a global system where anyone can create course materials andmake them fully accessible and easily reusable free of charge. We are a Web-based authoring, teaching andlearning environment open to anyone interested in education, including students, teachers, professors andlifelong learners. We connect ideas and facilitate educational communities.

    Connexions's modular, interactive courses are in use worldwide by universities, community colleges, K-12schools, distance learners, and lifelong learners. Connexions materials are in many languages, includingEnglish, Spanish, Chinese, Japanese, Italian, Vietnamese, French, Portuguese, and Thai. Connexions is partof an exciting new information distribution system that allows for Print on Demand Books. Connexionshas partnered with innovative on-demand publisher QOOP to accelerate the delivery of printed coursematerials and textbooks into classrooms worldwide at lower prices than traditional academic publishers.