TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

Embed Size (px)

Citation preview

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    1/71

    TRIBHUVAN UNIVERSITY

    INSTITUTE OF ENGINEERING

    PULCHOWK CAMPUS

    TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION

    By:

    GANESH TIWARI (063/BCT/510)

    MADHAV PANDEY (063/BCT/514)

    MANOJ SHRESTHA (063/BCT/518)

    A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS

    AND COMPUTER ENGINEERING IN PARTIAL FULFILLMENT OF THEREQUIREMENT FOR THE BACHELORS DEGREE IN COMPUTER

    ENGINEERING

    DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

    LALITPUR, NEPAL

    January, 2011

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    2/71

    TRIBHUVAN UNIVERSITY

    INSTITUTE OF ENGINEERING

    PULCHOWK CAMPUS

    DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

    The undersigned certify that they have read, and recommended to the Institute of Engineering for

    acceptance, a project report entitled Text-Prompted Remote Speaker Authentication submitted

    by Ganesh Tiwari, Madhav Pandey and Manoj Shrestha in partial fulfillment of the requirements

    for the Bachelors degree in Computer Engineering.

    __________________________________

    Supervisor, Dr. Subarna Shakya

    Associate Professor

    Department of Electronics and Computer Engineering

    __________________________________

    Internal Examiner,

    _________________________________

    External Examiner,

    DATE OF APPROVAL:

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    3/71

    COPYRIGHT

    The author has agreed that the Library, Department of Electronics and Computer Engineering,

    Pulchowk Campus, Institute of Engineering may make this report freely available for inspection.

    Moreover, the author has agreed that permission for extensive copying of this project report for

    scholarly purpose may be granted by the supervisors who supervised the project work recorded

    herein or, in their absence, by the Head of the Department wherein the project report was done. It

    is understood that the recognition will be given to the author of this report and to the Department

    of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use

    of the material of this project report. Copying or publication or the other use of this report for

    financial gain without approval of to the Department of Electronics and Computer Engineering,

    Pulchowk Campus, Institute of Engineering and authors written permission is prohibited.Request for permission to copy or to make any other use of the material in this report in whole or

    in part should be addressed to:

    Head

    Department of Electronics and Computer Engineering

    Pulchowk Campus, Institute of Engineering

    Lalitpur, Kathmandu

    Nepal

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    4/71

    ACKNOWLEDGEMENT

    We are very thankful to Institute of Engineering (IOE), Pulchowk Campus for offering the

    course on major project. We also thank all teachers and staffs of Electronics and Computer

    Engineering Department who assisted during the project conduction period by giving suitable

    suggestions and lectures on different subject matters relating to the conduction and achievement

    of the project goals.

    We are very much obliged to Dr. Subarna Shakya, Department, Electronics and Computer

    Engineering, IOE Pulchowk Campus, for their inspiration and valuable suggestions that we got

    throughout the working period.

    We would like to thank to the forum members of askmeflash.com, stackoverflow.com,

    dsprelated.com for their quick response and value able opinion to our queries.

    We also express our gratitude to all the friends and juniors who helped a lot for training data

    collection.

    Members of Project

    Ganesh Tiwari (063BCT510)

    Madhav Pandey (063BCT514)

    Manoj Shrestha (063BCT518)

    IOE, PULCHOWK CAMPUS

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    5/71

    ABSTRACT

    Biometric is physical characteristic unique to each individual. It has a very useful application in

    authentication and access control.

    The designed system is a text-prompted version of voice biometric which incorporates text-

    independent speaker verification and speaker-independent speech verification system

    implemented independently. The foundation for this joint system is that the speech signal

    conveys both the speech content and speaker identity. Such systems are more-secure from

    playback attack, since the word to speak during authentication is not previously set.

    During the course of the project various digital signal processing and pattern classification

    algorithms were studied. Short time spectral analysis was performed to obtain MFCC, energy

    and their deltas as feature. Feature extraction module is same for both systems. Speaker

    modeling was done by GMM and Left to Right Discrete HMM with VQ was used for isolated

    word modeling. And results of both systems were combined to authenticate the user.

    The speech model for each word was pre-trained by using utterance of 45 English words. The

    speaker model was trained by utterance of about 2 minutes each by 15 speakers. While uttering

    the individual words, the recognition rate of the speech recognition system is 92 % and speaker

    recognition system is 66%. For longer duration of utterance (>5sec) the recognition rate of

    speaker recognition system improves to 78%.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    6/71

    TABLE OF CONTENTS

    PAGE OF APPROVAL.I

    COPYRIGHT ................................................................................................................... 2

    ACKNOWLEDGEMENT ................................................................................................. 3

    ABSTRACT ..................................................................................................................... 4

    TABLE OF CONTENTS .............................................................................................. V

    LIST OF FIGURES .......................................................................................................... 1

    LIST OF SYMBOLS AND ABBREVIATIONS ............................................................. IX

    1. INTRODUCTION......................................................................................................... 1

    1.2 Objectives ................................................................................................................... 2

    2. LITERATURE REVIEW .............................................................................................. 32.1 Pattern Recognition..................................................................................................... 3

    2.2 Generation of Voice ................................................................................................... 4

    2.3 Voice as Biometric ..................................................................................................... 6

    2.4 Speech Recognition .................................................................................................... 7

    2.5 Speaker Recognition ................................................................................................... 7

    2.5.1. Types of Speaker Recognition ............................................................................. 8

    2.5.2. Modes of Speaker Recognition ............................................................................ 9

    2.6 Feature Extraction for Speech/Speaker Recognition System ................................ ...... 10

    2.6.1. Short Time Analysis .......................................................................................... 10

    2.6.2. MFCC Feature ................................................................................................... 11

    2.7 Speaker/Speech Modeling ......................................................................................... 12

    2.7.1. Gaussian Mixture Model ................................................................................... 12

    2.7.2. Hidden Markov Model ...................................................................................... 15

    2.7.3. K-Means Clustering .......................................................................................... 19

    3. IMPLEMENTATION DETAILS ................................................................................ 20

    3.1 Pre-Processing and Feature Extraction ...................................................................... 20

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    7/71

    3.1.1. Capture .............................................................................................................. 20

    3.1.2. End point Detection and Silence Removal ......................................................... 21

    3.1.3. PCM Normalization .......................................................................................... 22

    3.1.4. Pre-emphasis ..................................................................................................... 22

    3.1.5. Framing and Windowing ................................................................................... 23

    3.1.6. Discrete Fourier Transform ............................................................................... 25

    3.1.7. Mel Filter .......................................................................................................... 25

    3.1.8. Cepstrum by Inverse Discrete Fourier Transform .............................................. 27

    3.2 GMM Implementation .............................................................................................. 30

    3.2.1. Block Diagram of GMM Based Speaker Recognition System, ........................... 30

    3.2.2. GMM Training .................................................................................................. 31

    3.2.3. Verification ....................................................................................................... 34

    3.2.4. Performance Measure of Speaker Verification System....................................... 34

    3.3 Implementation of HMM for Speech Recognition ..................................................... 36

    3.3.1. Isolated Word Recognition ................................................................................ 39

    3.3.2. Application of HMM ......................................................................................... 40

    3.3.3. Scaling .............................................................................................................. 47

    4. UML CLASS DIAGRAMS OF THE SYSTEMS ........................................................ 48

    5. DATA COLLECTION AND TRAINING ................................................................... 50

    6. RESULTS ................................................................................................................... 51

    7. APPLICATION AREA ............................................................................................... 52

    8. CONCLUSION ........................................................................................................... 52

    REFERENCES ............................................................................................................... 53

    APPENDIX A: BlazeDS Configuration for Remoting Service ........................................ 54

    APPENDIX B: Words Used for HMM Training.............................................................. 55

    APPENDIX C: Development Tools and Environment ..................................................... 56

    APPENDIX D: Snapshots of Output GUI ....................................................................... 57

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    8/71

    LIST OF FIGURES

    Figure 1.1: System Architecture..1

    Figure 1.2: Block Diagram of Text Prompted Speaker Verification System .2

    Figure 2.1: General block diagram of pattern recognition system ..3

    Figure 2.2: Vocal Schematic....4

    Figure 2.3: Audio Sample for /i:/ phoneme ....5

    Figure 2.4: Audio Magnitude Spectrum for /i:/ phoneme ..6

    Figure 2.5: GMM with four Gaussian components and their equivalent model ..13

    Figure 2.6: Ergodic Model of HMM ....17

    Figure 2.7: Left to Right HMM ...18

    Figure 3.1: Pre-Processing and Feature Extraction..20

    Figure 3.2: Input signal to End-point detection system ...22

    Figure 3.3: Output signal from End point Detection System .. 22

    Figure 3.4: Signal before Pre-Emphasis . .23

    Figure 3.5: Signal after Pre-Emphasis 23

    Figure 3.6: Frame Blocking of the Signal 23

    Figure 3.7: Hamming window . 24

    Figure 3.8: A single frame before and after windowing .. 24

    Figure 3.9: Equally spaced Mel values 26

    Figure 3.10: Mel Scale Filter Bank .. 26

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    9/71

    Figure 3.11: Block diagram of GMM based Speaker Recognition System . 30

    Figure 3.12: Equal Error Rate (EER) ... 35

    Figure 3.13: Speech Recognition algorithm flow .36

    Figure 3.14: Pronunciation model of word TOMATO .37

    Figure 3.15: Vector Quantization ..38

    Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of

    states N in the HMM .39

    Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the

    size of the VQ, M (shown of a log scale) . 40

    Figure 3.18: Forward Procedure - Induction Step .. 42

    Figure 3.19: Backward Procedures - Induction Step .. 43

    Figure 3.20: Viterbi Search . ... 45

    Figure 3.21: Computation of t (i, j) . 46

    Figure 4.1: UML diagram of Client System .48

    Figure 4.2: UML Diagram of Server System ....49

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    10/71

    LIST OF SYMBOLS AND ABBREVIATIONS

    GMM/HMM Model

    T Threshold

    Variance

    () Likelihood Ratio

    Mean

    Initial State Distribution

    A State Transition Probability Distribution

    Observation Symbol Probability Distribution

    Cm Covariance Matrix for mth

    Component

    State at time t

    Wm Weighting Factor for mth

    Gaussian Component

    Feature Vector

    AIR Adobe Integrated Runtime

    DC Direct Current

    DCT Discrete Cosine Transform

    DFT Discrete Fourier Transform

    DHMM Discrete Hidden Markov Model

    DTW Dynamic Time Warping

    EM Expectation-Maximization

    FAR False Acceptance Rate

    FRR False Rejection Rate

    GMM Gaussian Markov Model

    HMM Hidden Markov Model

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    11/71

    LPC Linear Prediction Coding

    MFCC Mel Frequency Cepstral Coefficient

    ML Maximum Likelihood

    PDF Probability Distribution Function

    PLP Perceptual Linear Prediction

    RIA Rich Internet Application

    RPC Remote Procedure Call

    SID Speaker IDentification

    TER Total Error Rate

    UBM Universal Background Model

    UML Unified Modeling Language

    VQ Vector Quantization

    WTP Web Tool Platform

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    12/71

    1

    1. INTRODUCTION

    Biometrics is, in the simplest definition, something you are. It is a physical characteristic

    unique to each individual such as fingerprint, retina, iris, speech. Biometrics has a very useful

    application in security; it can be used to authenticate a persons identity and control access to

    a restricted area, based on the premise that the set of these physical characteristics can be

    used to uniquely identify individuals.

    Speech signal conveys two important types of information, the primarily the speech content

    and on the secondary level, the speaker identity. Speech recognizers aim to extract the lexical

    information from the speech signal independently of the speaker by reducing the inter-

    speaker variability. On the other hand, speaker recognition is concerned with extracting the

    identity of the person speaking the utterance. So both speech recognition and speaker

    recognition system is possible from same voice input.

    Text Prompted Remote Speaker Authentication is a voice biometric system that authenticates

    a user before permitting the user to log into a system on the basis of the user's input voice. It

    is a web application. Voice signal acquisition and feature extraction is done on the client.

    Training and Authentication task based on the voice feature obtained from client side is done

    on Server. The authentication task is based on text-prompted version of speaker recognition,

    which incorporates both speaker recognition and speech recognition. This joint

    implementation of speech and speaker recognition includes text-independent speaker

    recognition and speaker-independent speech recognition. Speaker Recognition verifies

    whether the speaker is claimed one or not while Speech Recognition verifies whether or not

    spoken word matches the prompted word.

    The client side is realized in Adobe Flex whereas the server side is realized in Java. The

    communication between these two cross-platforms is made possible with the help of Blaze

    DSs RPC remote object.

    Browser Application in Client(Flex)

    Server(Java)User BlazeDS RPC

    Figure 1.1: System Architecture

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    13/71

    2

    Mel Filter Cepstral Coefficient (MFCC) is used as feature for both speech and speaker

    recognition task. We also combined energy features and delta and delta-delta features of

    energy and MFCC.

    After calculating feature, Gaussian Mixture Model (GMM) is used to model the speaker

    modeling and Left to Right Discrete Hidden Markov Model with Vector Quantization(DHMM/VQ) for speech modeling.

    Based on the speech model the system decides whether or not the uttered speech matches

    what was prompted to utter. Similarly, based on the speaker model, the system decides

    whether or not the speaker is claimed one. Then the speaker is authenticated with the help of

    combined result of these two tests.

    Referring to figure 1.2, the feature extraction module is same for both speech and speaker

    recognition. And these recognition systems are implemented independent of each other.

    Figure 1.2: Block Diagram of Text Prompted Speaker Verification System

    1.2 Objectives

    The objectives of this project are:

    To design and build a speaker verification system

    To design and build a speech verification system

    To implement these systems jointly to control remote access to secret area

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    14/71

    3

    2. LITERATURE REVIEW

    2.1 Pattern Recognition

    Pattern recognition, one of the branches of artificial intelligence, sub-section of machine

    learning, is the study of how machines can observe the environment, learn to distinguishpatterns of interest from their background, and make sound and reasonable decisions about

    the categories of the patterns. A pattern can be a fingerprint image, a handwritten cursive

    word, a human face, or a speech signal, sales pattern etc

    The applications of pattern recognition include data mining, document classification,

    financial forecasting, organization and retrieval of multimedia databases, and biometrics

    (personal identification based on various physical attributes such as face, retina, speech, ear

    and fingerprints).

    The essential steps of pattern recognition are: Data Acquisition, Preprocessing, Feature

    Extraction, Training and Classification.

    Figure 2.1: General block diagram of pattern recognition system

    Features are used to denote the descriptor. Features must be selected so that they are

    discriminative and invariant. They can be represented as a vector, matrix, tree, graph, or

    string. Theyare ideally similar for objects in the same class and very different for objects in

    different class.

    Pattern class is a family of patterns that share some common properties. Pattern recognition

    by machine involves techniques for assigning patterns to their respective classes

    automatically and with as little human intervention as possible.

    Learning and Classification usually use one of the following approaches: Statistical Pattern

    Recognition is based on statistical characterizations of patterns, assuming that the patterns are

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    15/71

    4

    generated by a probabilistic system. Syntactical (or Structural) Pattern Recognition is based

    on the structural interrelationships of features.

    Given a pattern, its recognition/classification may consist of one of the following two tasks

    according to the type of learning procedure: 1) Supervised Classification (e.g., Discriminant

    Analysis) in which the input pattern is identified as a member of a predefined class. 2)Unsupervised Classification (e.g., clustering) in which the pattern is assigned to a previously

    unknown class.

    2.2 Generation of Voice

    Speech begins with the generation of an airstream, usually by the lungs and diaphragm -

    process called initiation. This air then passes through the larynx tube, where it is modulated

    by the glottis (vocal chords). This step is called phonation or voicing, and is responsible for

    the generation of pitch and tone. Finally, the modulated air is filtered by the mouth, nose, and

    throat - a process called articulation - and the resultant pressure wave excites the air.

    Figure 2.2: Vocal Schematic

    Depending upon the positions of the various articulators different sounds are produced.

    Position of articulators can be modeled by linear time- invariant system that has frequency

    response characterized by several peaks called formants. The change in frequency of

    formants characterizes the phoneme being articulated.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    16/71

    5

    As a consequence of this physiology, we can notice several characteristics of the frequency

    domain spectrum of speech. First of all, the oscillation of the glottis results in an underlying

    fundamental frequency and a series of harmonics at multiples of this fundamental.

    This is shown in the figure below, where we have plotted a brief audio waveform for the

    phoneme /i:/ and its magnitude spectrum. The fundamental frequency (180 Hz) and itsharmonics appear as spikes in the spectrum. The location of the fundamental frequency is

    speaker dependent, and is a function of the dimensions and tension of the vocal chords. For

    adults it usually falls between 100 Hz and 250 Hz, and females average significantly higher

    than that of males.

    Figure 2.3: Audio Sample for /i:/ phoneme showing stationary property of phonemes for a short period

    The sound comes out in phonemes which are the building blocks of speech. Each phoneme

    resonates at a fundamental frequency and harmonics of it and thus has high energy at those

    frequencies in other words have different formats. It is the feature that enables the

    identification of each phoneme at the recognition stage.

    The variations in inter-speaker features of speech signal during utterance of a word are

    modeled in word training in speech recognition. And for speaker recognition the intra-

    speaker variations in features in long speech content is modeled.

    0 500 1000 1500 2000 2500-0.2

    -0.15

    -0.1

    -0.05

    0

    0.05

    0.1

    0.15

    Samples

    Amplitude

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    17/71

    6

    Figure 2.4: Audio Magnitude Spectrum for /i:/ phoneme showing fundamental frequency and its harmonics

    Besides the configuration of articulators, the acoustic manifestation of a phoneme is affectedby:

    Physiology and emotional state of speaker

    Phonetic context

    Accent

    2.3 Voice as Biometric

    The underlying premise for voice authentication is that each persons voice differs in pitch,

    tone, and volume enough to make it uniquely distinguishable. Several factors contribute to

    this uniqueness: size and shape of the mouth, throat, nose, and teeth (articulators) and the

    size, shape, and tension of the vocal cords. The chance that all of these are exactly the same

    in any two people is very low.

    Voice Biometric has following advantages from other form of biometrics

    Natural signal to produce

    Implementation cost is low since, doesnt require specialized input device

    Acceptable by user

    Easily mixed with other form of authentication system for multifactor authentication

    Only biometric that allows users to authenticate remotely

    0 500 1000 1500 2000 2500 3000 3500 40000

    0.005

    0.01

    0.015

    0.02

    0.025

    0.03

    0.035

    0.04

    Frequency (Hz)

    |Y(f)|

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    18/71

    7

    2.4 Speech recognition

    Speech is the dominant means for communication between humans, and promises to be

    important for communication between humans and machines, if it can just be made a little

    more reliable.

    Speech recognition is the process of converting an acoustic signal to a set of words. The

    applications include voice commands and control, data entry, voice user interface, automating

    the telephone operators job in telephony, etc. They can also serve as the input to natural

    language processing.

    There is two variant of speech recognition based on the duration of speech signal : Isolated

    word recognition, in which each word is surrounded by some sort of pause, is much easier

    than recognizing continuous speech, in which words run into each other and have to be

    segmented.

    Speech recognition is a difficult task because of the many source of variability associated

    with the signal such as the acoustic realizations of phonemes, the smallest sound units of

    which words are composed, are highly dependent on the context. Acoustic variability can

    result from changes in the environment as well as in the position and characteristics of the

    transducer. Third, within speaker variability can result from changes in the speaker's physical

    and emotional state, speaking rate, or voice quality. Finally, differences in socio linguistic

    background, dialect, and vocal tract size and shape can contribute to cross-speaker variability.

    Such variability is modeled in various ways. At the level of signal representation, the

    representation that emphasizes the speaker independent features is developed.

    2.5 Speaker Recognition

    Speaker recognition is the process of automatically recognizing who is speaking on the basis

    of individuals information included in speech waves. Speaker recognition can be classified

    into identification and verification. Speaker recognition has been applied most often as means

    of biometric authentication.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    19/71

    8

    2.5.1. Types of Speaker Recognition

    2.5.1.1 Speaker Identification

    Speaker identification is the process of determining which registered speaker provides a

    given utterance. In Speaker IDentification (SID) system, no identity claim is provided, the

    test utterance is scored against a set of known (registered) references for each potential

    speaker and the one whose model best matches the test utterance is selected.

    There is two types of speaker identification task closed-set and open-set speaker

    identification.

    In closed-set, the test utterance belongs to one of the registered speakers. During testing, a

    matching score is estimated for each registered speaker. The speaker corresponding to the

    model with the best matching score is selected. This requires N comparisons for a populationof N speakers.

    In open-set, any speaker can access the system; those who are not registered should be

    rejected. This requires another model referred to as garbage model or imposter model or

    background model, which is trained with data provided by other speakers different from the

    registered speakers. During testing, the matching score corresponding to the best speaker

    model is compared with the matching score estimated using the garbage model. In order to

    accept or reject the speaker, making the total number of comparisons equal to N + 1. Speaker

    identification performance tends to decrease as the population size increases.2.5.1.2 Speaker verification

    Speaker verification, on the other hand, is the process of accepting or rejecting the identity

    claim of a speaker. That is, the goal is to automatically accept or reject an identity that is

    claimed by the speaker. During testing, a verification score is estimated using the claimed

    speaker model and the anti-speaker model. This verification score is then compared to a

    threshold. If the score is higher than the threshold, the speaker is accepted, otherwise, thespeaker is rejected. Thus, speaker verification, involves a hypothesis test requiring a simple

    binary decision: accept or reject the claimed identity regardless of the population size. Hence,

    the performance is quite independent of the population size, but it depends on the number of

    test utterances used to evaluate the performance of the system.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    20/71

    9

    2.5.2. Modes of Speaker Recognition

    There are 3 modes in which speaker verification/identification can be done.

    2.5.2.1 Text Independent

    In text independent mode, the system relies only on the voice characteristics of the speaker;

    the lexical content of the utterance is not used. System models the characteristics of his

    speech which show up irrespective of what one is saying. This mode is used in surveillance

    or forensic applications where there is no control over the speakers to access the system. The

    test utterances can be different from those used for enrollment; hence, text-independent

    speaker verification needs a large and rich training data set to model the characteristics of the

    speaker's voice and to cover the phonetic space.

    A large training set and long test segments is required to appropriately model the featurevariations from current user in uttering different phonemes, than that for text-dependent.

    2.5.2.2 Text Dependent

    In the text dependent mode of verification, the user is expected to say a pre-determined text -

    a voice password. Since recognition is based on the speaker characteristics as well as the

    lexical content of the password, text dependent speaker recognition systems are generally

    more robust and achieve good performance. However, this system is not yet used in large

    scale due to fear of playback attack, since, the system has a priori knowledge about the

    password i.e., the training and the test texts are the same. The speaker model encodes the

    speaker's voice characteristics associated with the phonemic or syllabic content of the

    password.

    2.5.2.3 Text-prompted

    Both text-dependent and text-independent systems are susceptible to fraud, since for typical

    applications the voice of a speaker could be captured, recorded, and reproduced. To limit this

    risk, a particular kind of text-dependent speaker verification systems based on prompted text

    has been developed. The password i.e., the text to speak is not pre-determined; rather he/she

    is asked to speak a prompted text (digits or word or phrase). If the number of distinct random

    passwords is large, the playback attack is not feasible. Hence the text prompted system is

    more secure.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    21/71

    10

    As in the case of text-independent systems, the text-prompted systems also need a large and

    rich training data set for each registered speaker to create robust speaker-dependent models.

    Because of that reason, we have chosen text prompted system.

    2.6 Feature Extraction for speech/speaker recognition system

    Signal representation or coding from short-term spectrum into feature vectors is one of the

    most important steps in automatic speaker recognition and continues being subject of

    research. Many different techniques have been proposed in the literature and generally they

    are based on speech production models or speech perception models.

    Goal of feature extraction is to transform the input waveform into a sequence of acoustic

    feature vectors, each vector representing the information in a small time window of the

    signal. Feature extraction transforms high-dimensional input signal into lower dimensional

    vectors. For speaker recognition purposes, optimal feature has the following properties

    1. High inter-speaker variation,

    2. Low intra-speaker variation,

    3. Easy to measure,

    4. Robust against disguise and mimicry,

    5. Robust against distortion and noise,

    6. Maximally independent of the other features.

    2.6.1. Short time analysis

    The analysis at spectral level of the speech signal is based on classic Fourier analysis to the

    whole speech signal. However, an exact definition of Fourier transform cannot be directly

    applied because speech signal cannot be considered stationary due to constant changes in the

    articulatory system within each speech utterance. To solve these problems, speech signal is

    split into a sequence of short segments in such a way that each one is short enough to be

    considered pseudo-stationary. The length of each segment, also called window or frame,

    ranges between 10 and 40ms (in such a short time period our articulatory system is not able

    to significantly change). Finally, a feature vector will be extracted from the short-time

    spectrum in each window. The whole process, known as short-term Spectral analysis,

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    22/71

    11

    2.6.2. MFCC Feature

    The commonly used feature extraction method for speech/ speaker recognition is LPC (linear

    prediction coding), MFCC (Mel Frequency Cepstral Coefficients) and PLP (Perceptual

    Linear Prediction). LPC is based on assumption that a speech sample can be approximated by

    a linearly weighted summation of determined number of preceding samples. PLP iscalculated in a similar way as LPC coefficients, but previous transformations are carried out

    in the spectrum of each window aiming at introducing about human hearing behavior.

    The most popular feature extraction method, MFCC mimic the human hearing behavior by

    emphasizing lower frequencies and penalizing higher frequencies.

    The Mel scale, proposed by Stevens, Volkman and Newman in 1937 is a perceptual scale of

    pitches judged by listeners to be equal in distance from one another.

    The Mel scale is based on an empirical study of the human perceived pitch or frequency.

    Human hearing, however, is not equally sensitive at all frequency bands. It is less sensitive at

    higher frequencies, roughly above 1000 Hertz. It is a unit of pitch defined so that pairs of

    sounds which are perceptually equidistant in pitch are separated by an equal number of Mels.

    The mapping between frequency in Hertz and the Mel scale is linear below 1000 Hz and the

    logarithmic above 1000 Hz.

    (

    ) = 2595log

    1 +

    700

    Modeling this property of human hearing during feature extraction improves speech

    recognition performance. The form of the model used in MFCCs is to warp the frequencies

    output by the DFT onto the Mel scale. During MFCC computation, this insight is

    implemented by creating a bank of filters which collect energy from each frequency band.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    23/71

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    24/71

    13

    recognition systems due to their capability of representing large class of sample distributions.

    Like K-Means, Gaussian Mixture Models (GMM) can be regarded as a type of unsupervised

    learning or clustering methods. GMM is based on clustering technique, where the entire set

    of experimental data set is modeled by a mixture of Gaussians. But unlike K-Means, GMMs

    are able to build soft clustering boundaries, i.e., points in space can belong to any class with a

    given probability.

    In a Gaussian mixture distribution, its density function is just a convex combination (a linear

    combination in which all coefficients or weights sum to one) of Gaussian probability density

    functions:

    Figure: 2.5: GMM with four Gaussian components and their equivalent model

    Mathematically, A GMM is the weighted sum of M Gaussian component densities given bythe equation

    (/) = .(/ ,) where,

    is a k dimensional random vector,wm are the mixture weights that shows the relative importance of each component and

    satisfies the constraint that = 1. (/ ,), m=1,2,,M are the component densities where each componentdensity is a k-dimensional Gaussian function (pdf) of the form

    (/ ,) = 1(2).|| exp{1

    2( ).(( ))}

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    25/71

    14

    Where,

    s the mean vector of length k of mth Gaussian PDF,Cm is the covariance matrix of kk of m

    thGaussian PDF

    Thus the complete Gaussian Mixture Model is parameterized by mixture weights, mean

    vectors and covariance matrices for all component densities. The parameters are collectively

    represented by the notation,

    = {,,}, m = 1, 2,, MThese parameters are estimated in training section. For speaker recognition system, each

    speaker is represented by a GMM and is referred by his/her model .

    GMM is widely used in speaker modeling and classification due to its two important benefits:

    first the individual Gaussian component in a speaker-dependent GMM are interpreted to

    represent some broad acoustic classes such as speaker-dependent vocal tract configurations

    that are useful for modeling speaker identity. A speaker voice can be characterized by a set of

    acoustic classes representing some broad phonetic events such as vowels, nasals, fricatives.

    These acoustic classes reflect some general speaker-dependent vocal tract configurations that

    are useful for characterizing speaker identity. The spectral shape of the i th acoustic class can

    in turn be represented by mean of the ith component density and variations of the average

    spectral shape can be represented by the covariance matrix. These acoustic classes are hidden

    before training. Secondly Gaussian mixture density provides a smooth approximation to the

    long term sample distribution of training utterances by a given speaker. The unimodal

    Gaussian speaker model represents a speakers feature distribution by a mean vector and

    covariance matrix and the VQ model represents a speakers distribution by a discrete set of

    characteristic templates. GMM acts as a hybrid between these two models using a discrete set

    of Gaussian functions, each with their own mean and covariance matrix to allow better

    modeling capability.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    26/71

    15

    2.7.2. Hidden Markov Model

    In general, a Markov model is a way of describing a process that goes through a series of

    states. The model describes all the possible paths through the state space and assigns a

    probability to each one. The probability of transitioning from the current state to another one

    depends only on the current state, not on any prior part of the path.

    HMMs can be applied in many fields where the goal is to recover a data sequence that is not

    immediately observable. Common applications include: Cryptanalysis, Speech recognition,

    Part-of-speech tagging, Machine translation, Partial discharge, Gene prediction, Alignment of

    bio-sequences, Activity recognition.

    2.7.2.1 Discrete Markov Processes

    The transition probability

    with N distinct states,

    ,

    ,

    ,,

    , for the first order

    Markov chain is given by:

    = = = , 1 , where is the state at time t.

    The state transition coefficients have the following properties (due to standard stochasticconstraints):

    0

    ,

    = 1The transition probabilities for all states in a model can be described by a transition

    probability matrix:

    A =

    The initial state distribution matrix is given by:

    = =( = 1) =( = 2) =( =)

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    27/71

    16

    The stochastic property for initial state distribution vector is:

    = 1 where the

    is defined as:

    = ( = ),1 The Markov model can be described by

    = (,)This stochastic process could be called an observable Markov model since the output of the

    process is the set of states at each instant of time, where each state corresponds to physical

    (observable) event.

    2.7.2.3 Hidden Markov Model

    Markov model is too restrictive to be applicable to many problems of interest. So the concept

    of Markov model is extended to Hidden Markov model to include the case where the

    observation is a probabilistic function of the state. The resulting model is doubly embedded

    stochastic process with an underlying stochastic process that is not observable (i.e. hidden),

    but can only be observed through another set of stochastic processes that produce the

    sequence of observations. The difference is that in Markov Chain the output state is

    completely determined at each time t. In the Hidden Markov Model the state at each time t

    must be inferred from observations. An observation is a probabilistic function of a state.

    Elements of HMM

    The HMM is characterized by the following:

    1) Set of hidden states

    S = {S1,S2,,SN} and

    state at time t, qt S

    2) Set of observation symbols per state

    V = {v1,v2,,vM}

    observation at time t, Ot V

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    28/71

    17

    3) The initial state distribution

    ={ i } i = P[q1 = S1] 1 i N

    4) State transition probability distribution

    A = {aij} aij =P[qt+1 = Si|qt = Si] 1 i, j N

    5) Observation symbol probability distribution in state j

    B = {bj(k)} bj(k) = P[vkat t|qt = Sj] 1 j N, 1 k M

    Normally, an HMM is typically written as: = (,,)2.7.2.4 Types of HMMS

    An ergodic or fully connected HMMs has the property that every state can be reached from

    every other state in a finite number of steps. This type of model has the property that every

    aij coefficient is positive. For some applications, other types of HMMs have been found toaccount for observed properties of the signal being modeled better than the standard ergodic

    model.

    Figure 2.6: Ergodic Model of HMM

    One such model is left-right model or Bakis model because the underlying state sequence

    associated with the model has the property that as time increases the state index increases (or

    stays the same), i.e. the state proceed from left to right. Clearly, the left-right type of HMMs

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    29/71

    18

    has the desirable property that it can readily model signals whose properties change over time

    e.g., Speech.

    State1 State2State3 State4

    a11

    a13

    a12 a23a22 a34

    a44a33

    a24

    Figure 2.7: Left to Right HMM

    The properties of left-right HMMs are:

    1) The state transition coefficients have the property = 0, < i.e., no transition is allowed to states whose indices are lower than the current state.

    3) The state transition coefficient for the last state in a left-right model are specified as

    = 12) The initial state probabilities have the property

    i

    = 1,

    = 1

    = 0, 1Since the state sequence must begin in state 1 and end in state N.

    With left-right models, additional constraints are placed on the state transition coefficients to

    make sure that large changes in state indices do not occur, hence a constraint of the form

    = 0, > +is often used. The value of is 2 in this speech recognition system, i.e., no jumps of more

    than 2 states are allowed. The form of the state transition matrix for = 2 and N=4 is asfollows.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    30/71

    19

    2.7.3. K-Means Clustering

    Clustering can be considered the most important unsupervised learning problem; so, as every

    other problem of this kind, it deals with finding a structure in a collection of unlabeled data.

    A loose definition of clustering could be the process of organizing objects into groups whose

    members are similar in some way. A cluster is therefore a collection of objects which aresimilar between them and are dissimilar to the objects belonging to other clusters.

    In statistics and machine learning,k-means clustering is a method of cluster analysis which

    aims to partition n observations into k clusters in which each observation belongs to the

    cluster with the nearest mean.

    The algorithm is composed of the following steps:

    1. Place K points into the space represented by the objects that are being clustered.

    These points represent initial group centroids.

    2. Assign each object to the group that has the closest centroid.

    3. When all objects have been assigned, recalculate the positions of the K centroids.

    4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation

    of the objects into groups from which the metric to be minimized can be calculated.

    Both the clustering process and the decoding process require a distance metric ordistortion

    metric, that specifies how similar two acoustic feature vectors are. Thedistance metric is used

    to build clusters, to find a prototype vector for each cluster, andto compare incoming vectors

    to the prototypes. The simplest distance metric for acoustic feature vectors is Euclidean

    distance. Euclidean distance is the distance in N-dimensional space between the two points

    defined by the two vectors.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    31/71

    20

    3. IMPLEMENTATION DETAILS

    The implementation of joint speaker/speech recognition system includes common pre-

    processing and feature extraction module, text independent speaker modeling and

    classification by GMM and speaker independent speech modeling and classification by

    HMM/VQ.

    3.1 Pre-Processing and Feature Extraction

    Starting from the capturing of audio signal, feature extraction consists of the following steps

    as shown in the block diagram below:

    Pre-emphasis

    Window DFTMel-Filter

    bankLog

    IDFT

    DeltasEnergy

    SPEECHSIGNAL

    MFCC 12Coefficients

    1 Energy

    Feature

    12 MFCC

    12 MFCC12 MFCC

    1 energy1 energy

    1 energy

    SilenceRemoval

    Framing

    CMS

    Figure 3.1: Pre-Processing and Feature Extraction

    3.1.1. Capture

    The first step in processing speech is to convert the analog representation (first air pressure,

    and then analog electric signals in a microphone) into a digital signal x[n], where n is an

    index over time. Analysis of the audio spectrum shows that nearly all energy resides in the

    band between DC and 4 kHz, and beyond 10 kHz there is virtually no energy whatsoever

    Used sound format

    22050 Hz

    16-bits, Signed

    Little Endian

    Mono Channel

    Uncompressed PCM

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    32/71

    21

    3.1.2. End point detection and Silence removal

    The captured audio signal may contain silence at different positions such as beginning of

    signal, in between the words of a sentence, end of signal. etc. If silent frames are included,

    modeling resources are spent on parts of the signal which do not contribute to the

    identification. The silence present must be removed before further processing.

    There are several ways for doing this: most popular are Short Time Energy and Zeros

    Crossing Rate. But they have their own limitation regarding setting thresholds as an ad hoc

    basis. The algorithm we used [Ref.4] uses statistical properties of background noise as well as

    physiological aspect of speech production and does not assume any ad hoc threshold. It

    assumes that background noise present in the utterances is Gaussian in nature.

    Usually first 200msec or more (we used 4410 samples for the sampling rate 22050

    samples/sec) of a speech recording corresponds to silence (or background noise) because the

    speaker takes some time to read when recording starts.

    Endpoint Detection Algorithm

    Step 1: Calculate the mean () and standard deviation () of the first 200ms samples of the

    given utterance. The background noise is characterized by this and.

    Step 2: Go from 1stsample to the last sample of the speech recording. In each sample, check

    whether one-dimensional Mahalanobis distance functions i.e. |x-|/greater than 3 or not. If

    Mahalanobis distance function is greater than 3, the sample is to be treated as voiced sample

    otherwise it is an unvoiced/silence.

    The threshold reject the samples up to 99.7% as per given by P[|x|3]=0.997 in a

    Gaussian Distribution thus accepting only the voiced samples.

    Step 3: Mark the voiced sample as 1 and unvoiced sample as 0. Divide the whole speech

    signal into 10 ms non-overlapping windows. Represent the complete speech by only zeros

    and ones.

    Step 4: Consider there areMnumber of zeros andNnumber of ones in a window. IfM N

    then convert each of ones to zeros and vice versa. This method adopted here keeping in mind

    that a speech production system consisting of vocal cord, tongue, vocal tract etc. cannot

    change abruptly in a short period of time window taken here as 10ms.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    33/71

    22

    Step 5: Collect the voiced part only according to the labeled 1 samples from the windowed

    array and dump it in a new array. Retrieve the voiced part of the original speech signal from

    labeled 1 sample.

    Figure 3.2: Input signal to End-point detection system

    Figure 3.3: Output signal from End point Detection System

    3.1.3. PCM Normalization

    The extracted pulse code modulated values of amplitude is normalized, to avoid amplitude

    variation during capturing.

    3.1.4. Pre-emphasis

    Usually speech signal is pre-emphasized before any further processing, if we look at the

    spectrum for voiced segments like vowels, there is more energy at lower frequencies than the

    higher frequencies. This drop in energy across frequencies is caused by the nature of the

    glottal pulse. Boosting the high frequency energy makes information from these higher

    formants more available to the acoustic model and improves phone detection accuracy.

    The pre-emphasis filter is a first-order high-pass filter. In the time domain, with input x[n]

    and 0.9 1.0, the filter equation is:

    y[n] =x[n]x[n1]

    We used=0.95.

    0 1 2 3 4 5 6 7 8 9

    x 104

    -1

    -0.5

    0

    0.5

    1

    0 0.5 1 1.5 2 2.5 3 3.5 4

    x 104

    -1

    -0.5

    0

    0.5

    1

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    34/71

    23

    Figure 3.4: Signal before Pre-Emphasis

    Figure 3.5: Signal after Pre-Emphasis

    3.1.5. Framing and windowing

    Speech is a non-stationary signal, meaning that its statistical properties are not constant

    across time. Instead, we want to extract spectral features from a small window of speech that

    characterizes a particular sub phone and for which we can make the (rough) assumption that

    the signal is stationary (i.e. its statistical properties are constant within this region).

    We used frame block of 23.22ms with 50% overlapping i.e., 512 samples per frame.

    Figure 3.6: Frame Blocking of the Signal

    0 2000 4000 6000 8000 10000 120000

    0.01

    0.02

    0.03

    0.04

    0.05

    Frequency (Hz)

    |Y(f)|

    0 2000 4000 6000 8000 10000 120000

    1

    2

    3

    4

    5x 10

    -3

    Frequency (Hz)

    |Y(f)|

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    35/71

    24

    The rectangular window (i.e., no window) can cause problems, when we do Fourier analysis;

    it abruptly cuts of the signal at its boundaries. A good window function has a narrow main

    lobe and low side lobe levels in their transfer functions, which shrinks the values of the signal

    toward zero at the window boundaries, avoiding discontinuities. The most commonly used

    window function in speech processing is theHamming window defined as follows:

    () = 0.54 0.46cos2( 1) 1 ,1

    Figure 3.7: Hamming window

    The extraction of the signal takes place by multiplying the value of the signal at time n,

    sframe[n], with the value of the window at time n, Sw[n]:

    y[n] = Sw[n]Sframe[n]

    Figure 3.8: A single frame before and after windowing

    0 10 20 30 40 50 60

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Hamming Window

    0 200 400 600 800 1000 1200-0.04

    -0.03

    -0.02

    -0.01

    0

    0.01

    0.02

    0.03

    0.04

    0 200 400 600 800 1000 1200-0.05

    -0.04

    -0.03

    -0.02

    -0.01

    0

    0.01

    0.02

    0.03

    0.04

    0.05

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    36/71

    25

    3.1.6. Discrete Fourier Transform

    A Discrete Fourier Transform (DFT) of the windowed signal is used to extract the frequency

    content (the spectrum) of the current frame. The tool for extracting spectral information i.e.,

    how much energy the signal contains at discrete frequency bands for a discrete-time

    (sampled) signal is the Discrete Fourier Transform or DFT. The input to the DFT is a

    windowed signal x[n]...x[m], and the output, for each ofN discrete frequency bands, is a

    complex numberX[k] representing the magnitude and phase of that frequency component in

    the original signal.

    = ()() , = 0,1,2,, 1The commonly used algorithm for computing the DFT is the Fast Fourier Transform or in

    short FFT.

    3.1.7. Mel Filter

    For calculating the MFCC, first, a transformation is applied according to the following

    formula:

    (

    ) = 2595log

    1 +

    700

    Where, x is the linear frequency.

    Then, a filter bank is applied to the amplitude of the mel-scaled spectrum.

    The Mel frequency warping is most conveniently done by utilizing a filter bank with filters

    centered according to Mel frequencies. The width of the triangular filters varies according to

    the Mel scale, so that the log total energy in a critical band around the center frequency is

    included. The centers of the filters are uniformly spaced in the mel scale.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    37/71

    26

    Figure 3.9: Equally spaced Mel values

    The result of Mel filter is information about distribution of energy at each Mel scale band.

    We obtain a vector of outputs (12 coeffs) from each filter.

    Figure 3.10: Triangular filter bank in frequency scale

    We have used 30 filters in the filter bank.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    38/71

    27

    The Mel frequency m can be computed from the raw acoustic frequency as follows:

    = + 1 + 1

    + + 1

    ,

    = 1,2,,

    1

    where,

    = = ;

    = 2 =2

    = {}+ {} = 1,2,3,, 1() = 2595ln1 +

    700

    = 700.10 13.1.8. Cepstrum by Inverse Discrete Fourier Transform

    Cepstrum transform is applied to the filter outputs in order to obtain MFCC feature of each

    frame. The triangular filter outputs Y (i), i=0, 1, 2 M are compressed using logarithm, and

    discrete cosine transform (DCT) is applied. Here, M is equal to number of filters in filter

    bank i.e., 30.

    [] = log() cos ( 12)

    Where, C[n] is the MFCC vector for each frame.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    39/71

    28

    The resulting vector is called the Mel-frequency cepstrum (MFC), and the individual

    components are the Mel-frequency cepstral coefficients (MFCCs). We extracted 12 features

    from each speech frame.

    3.1.9. Post Processing

    3.1.9.1 Cepstral Mean Subtraction (CMS)

    A speech signal may be subjected to some channel noise when recorded, also referred to as

    the channel effect. A problem arises if the channel effect when recording training data for a

    given person is different from the channel effect in later recordings when the person uses the

    system. The problem is that a false distance between the training data and newly recorded

    data is introduced due to the different channel effects. The channel effect is eliminated by

    subtracting the Mel-cepstrum coefficients with the mean Mel-cepstrum coefficients:

    () =() 1() , = 1,2,,123.1.9.2 The energy feature

    The energyin a frame is the sum over time of the power of the samples in the frame; thus for

    a signalx in a window from time sample t1 to time sample t2, the energy is:

    = [] 3.1.9.3 Delta feature

    Another interesting fact about the speech signal is that it is not constant from frame to frame.

    Co-articulation (influence of a speech sound during another adjacent or nearby speech sound)

    can provide a useful cue for phone identity. It can be preserved by using delta features.

    Velocity (delta) and acceleration (delta delta) coefficients are usually obtained from the static

    window based information. This delta and delta delta coefficients model the speed andacceleration of the variation of cepstral feature vectors across adjacent windows.

    A simple way to compute deltas would be just to compute the difference between frames;

    thus the delta value d(t) for a particular cepstral value c(t) at time tcan be estimated as:

    () = [] =[] []

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    40/71

    29

    The differentiating method is simple, but since it acts as a high-pass filtering operation on the

    parameter domain, it tends to amplify noise. The solution to this is linear regression, i.e. first-

    order polynomial, the least squares solution is easily shown to be of the following form:

    [

    ] =

    [

    ]

    Where, M is regression window size. We used M=4.

    3.1.9.4 Composition of Feature Vector

    We calculated 39 Features from each frame

    12 MFCC Features

    12 Delta MFCC

    12 Delta-Delta MFCC

    1 Energy Feature

    1 Delta Energy Feature

    1 Delta-Delta Energy Feature

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    41/71

    30

    3.2 GMM Implementation

    It is also important to note that because the component Gaussians are acting together to

    model the overall pdf, full covariance matrices are not necessary even if the features are not

    statistically independent. So, the linear combination of diagonal covariance basis Gaussians is

    capable of modeling the correlations between feature vector elements. In addition, the use ofdiagonal covariance matrices greatly reduces the complexity in computation. Hence in our

    project, the mth covariance matrix is

    Cm = diag ( am1 , am2,, amK),

    Where,

    amj,j = 1, 2,,K are the diagonal elements or variances

    K=Number of features in each feature vector

    The effect of a set of using a set of M full covariance Gaussians can be compensated by using

    by using a larger set of diagonal covariance Gaussians (M=16 in our case). M=16 is best for

    speaker Modeling, according to research papers.

    The components pdfs can now be expressed as,

    (/ ,) = 1

    (2

    ).

    ,

    exp{12(( )/,)

    }

    Where,m,j are the elements of m

    th mean vector.3.2.1. Block diagram of GMM based Speaker Recognition System,

    Feature

    Extraction

    Model training

    Model DB

    MatchingDecision

    Accepted /RejectedSpeech

    Enrollment

    Verification

    Figure 3.11: Block diagram of GMM based Speaker Recognition System

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    42/71

    31

    3.2.2. GMM Training

    Given the training speech from a speaker, the goal of speaker model training is to estimate

    the parameters of the GMM that best match the distribution of training features vectors and

    hence develop a robust model for the speaker. Out of several techniques available for

    estimating the parameters of GMM, the most popular method is Maximum Likelihood (ML)

    estimation orExpectation-Maximization (EM).

    It is a well-established maximum likelihood algorithm for fitting a mixture model to a set of

    training data. EM requires an a priori selection of model order, the number of M components

    to be incorporated into the model and initial estimate of training parameters before iterating

    through the training.

    The aim of the ML estimation method is to maximize the likelihood of GMM, given the

    training data. Under the assumption of independent feature vectors, the likelihood of GMM,

    for the sequence of T training vectors X = {,,,} can be written as,P(X/) =(/)

    In practice, the above computation is done in log domain to avoid underflow. That is, instead

    of multiplying lots of very small probabilities, we can simply add them in log domain.

    Thus, the log-likelihood of a model for a sequence of feature vectors X = {,,, } iscomputed as follows:logP(X/) =1 log(/)

    Note that in the above equation, the average log likelihood value is used so as to normalize

    out duration effects from the log-likelihood value. Also, since the incorrect assumption of

    independence is underestimating the actual likelihood value with dependencies, scaling by T

    can be considered as a rough compensation factor.

    The direct maximization of this likelihood function is not possible as it is a non-linear

    function of the parameter. So, the likelihood function is maximized using ExpectationMaximization algorithm.

    The basic idea of EM algorithm is beginning with the initial model , to estimate a newmodel such that P(X/) P(X/). The new model then becomes the initial

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    43/71

    32

    model for the next iteration and the process is repeated until some convergence threshold is

    reached. i.e., P(X/) - P(X/)

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    44/71

    33

    = 1 The j-th diagonal element of Cdata can be estimated as

    , =1

    ,

    A measure of the volume that the training data occupies can be given by

    =, Finally the covariance can be calculated as

    ,

    =

    (

    )

    For minimum covariance (threshold) value to avoid NaN (Not a Number) error during EMiterations,

    = ( ) Covariance limiting was done as calculated above for each mixture. For simplicity we

    initialized covariance values to be same for all gaussian components.

    For Training the GMM parameters we used the following constants:

    Number of Iterations:

    MINIMUM_ITERATION = 100;

    MAXIMUM_ITERATION = 500;

    And

    Minimum log likelihood change for Convergence:

    LOGLIKELIHOOD_CHANGE = 0.000001;

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    45/71

    34

    3.2.3. Verification

    After training section, now we have a complete model (GMM) of speakers. The speaker

    verification task is a hypothesis testing problem where based on the input speech

    observations, it must be decided whether the claimed identity of the speaker is correct or not.

    So, the hypothesis test can be set as:

    H0: the speaker is the claimed speaker

    H1: the speaker is an imposter

    The right decision between these two hypotheses is based on the likelihood ratio given by

    P(X/)

    P(X/

    )

    Where, P(X/) is the likelihood that the utterance was produced by speaker model while

    P(X/) is the likelihood that he utterance was produced by imposter model .Here, the imposter model , also called as Universal Background Model (UBM), is obtained

    by training a collection of speech samples from a large no. of speakers, representative of the

    population of speakers.

    The likelihood ratio is often expressed in logarithm as

    () = log((/)(/)) = logP(X/) logP(X/)The decision is made as follows:

    If() < T , reject null hypothesis i.e. the speaker is an imposter.If() > T , accept null hypothesis i.e. the speaker is the claimed one.

    where, the threshold value T is set in suck a way that, the error of the system is minimum sothat the true claimants are always accepted and false claimants are always rejected.

    3.2.4. Performance measure of Speaker Verification System

    In general, the performance of the speaker verification system is determined by False

    Rejection Rate (FRR) and False Acceptance Rate (FAR).

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    46/71

    35

    1) False Rejection Rate(FRR)

    FRR is the measure of the likelihood that the system will incorrectly reject an access

    attempted by an authorized user. A systems FRR typically is the ratio of the number of false

    rejections divided by the number of verification tests.

    2) False Acceptance Rate(FAR)

    FAR is the measure of the likelihood that the system will incorrectly accept an access attempt

    by an unauthorized user. A systems FAR usually is stated as the ratio of the number of false

    acceptances divided by the number of verification tests.

    Total Error Rate (TER) is the combination of false rejection and false acceptance rate. And

    the requirement of the system is to minimize the Total Error Rate. These errors are dependent

    on the choice of threshold value used during verification. It seems that, at lower thresholdvalue, FAR is predominant while at higher threshold value, FRR is predominant. This

    dependency of the two errors can be seen in the figure below. At certain threshold value,

    these errors are equal and TER is minimum.

    Figure 3.12: Equal Error Rate (EER)

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    47/71

    36

    3.3 Implementation of HMM for Speech Recognition

    The basic block diagram for isolated word recognition is given below:

    Pre-

    process

    MFCC

    Features

    Vector Quantization

    (VQ)HMM Recognition

    CODEBOOK

    HMMModel

    SpeechSignal

    K-means

    Clustering

    Recognition

    Result

    Baum-WelchAlgorithm

    ViterbiAlgorithmDiscrete

    Observation

    Sequence

    ),,( BA

    Figure 3.13: Speech Recognition algorithm flow

    In order to do isolated word speech recognition, we must perform the following:

    1) The codebook is generated using the feature vector of the training data and Vector

    quantization uses the codebook to map the feature vector to discrete observation

    symbol.

    2) For each word v in the vocabulary, an HMM v is built, i.e., we must estimate the

    model parameters (A, B, ) that optimize the likelihood of the training set observation

    vectors for the vth word. In order to make reliable estimates of all model parameters,

    multiple observation sequences must be used. Baum-Welch algorithm is used for

    estimation of HMM parameters.

    3) For each unknown word which is to be recognized, processing of some steps must be

    carried out, namely measurement of the observation sequence O={O1,O2,..,OT}, via

    feature analysis of the speech corresponding to the word, followed by calculation of

    model likelihoods for all possible models, P(O| v), 1 v V; followed by selection

    of the word whose model likelihood is highest.

    =

    max

    [P(O|

    ]

    The probability computation step is performed using the Viterbi algorithm and

    requires on the order of V.N2.T computations.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    48/71

    37

    Figure 3.14: Pronunciation model of word TOMATO

    The above figure shows the pronunciation model of word tomato. The circles represent the

    states and the numbers above the arrows represent transition probabilities. The pronunciation

    of the same word may differ from person to person. The above figure reflects the two

    pronunciation styles for the same word tomato. So, in order to best model each word, we

    need to train the word for as large set of persons as possible so that it models all the variationin pronunciation for that word.

    Vector Quantization:

    HMM is used in speech recognition because a speech signal can be viewed as a piecewise

    stationary signal or a short-time stationary signal. In a short-time speech can be approximated

    as a stationary process. Each acoustic feature vector represents information such as the

    amount of energy in different frequency bands at a particular point in time. The observation

    sequence for speech recognition is a sequence of acoustic feature vectors (MFCC vectors)andthe phonemes are the hidden states. One way to make MFCC vectors look like symbols

    that we could count is to build a mapping function that maps each input vector into one of a

    small number of symbols. This idea of mapping input vectors to discrete quantized symbols

    is called vector quantizationorVQ.

    The type of HMM that models speech signals based on VQ technique to produce the

    observations is called Discrete Hidden Markov Model (DHMM). However, VQ is

    responsible for losing some information from the speech signal even when we try to increase

    the codewords. This lose is due to the quantization error (distortion). This distortion can be

    reduced by increasing the number of codewords in the codebook but cannot be eliminated.

    The long sequence of speech samples will be represented by stream of indices representing

    frames of different window lengths. Hence, VQ is considered as a process of redundancy

    removal, which minimizes the number of bits required to identify each frame of speech

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    49/71

    38

    signal. In vector quantization, we create the small symbol set by mapping each training

    feature vector into a small number of classes, and then we represent each class by a discrete

    symbol. More formally, a vector quantization system is characterized by a codebook, a

    clustering algorithm, and adistance metric.

    Acodebookis a list of possible classes, a set of symbols constituting features F= {f1,f2, ...,fn}. All feature vector from training speech data are clustered into 256 classes thereby

    generating a Codebook with 256 centroids with the help of K-Means clustering technique.

    Vector Quantization (VQ) is used to get discrete observation sequence from input feature

    vector by applying distance metric to Codebook.

    Figure 3.15: Vector Quantization

    As shown in the above figure, to make the feature vectors discrete, each incoming feature

    vector is compared with each of the 256 prototype vectors in the codebook. And the one

    which is closest (Euclidian distance) is selected, and then the input vector is replaced by theindex of corresponding centroid in codebook. In this way all continuous input feature vectors

    are quantized to a discrete set of symbols.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    50/71

    39

    3.3.1. Isolated Word Recognition

    For isolated word recognition with a distinct HMM designed for each word in the vocabulary,

    a left-right model is more appropriate than an ergodic model, since we can then associate

    time with model states in a fairly straightforward manner. Furthermore we can envision the

    physical meaning of the model states as distinct sounds (e.g., phonemes, syllables) of theword being modeled.

    The issue of the number of states to use in each word model leads to two schools of thought.

    One idea is to let the number of states correspond roughly to the number of sounds

    (phonemes) within the word hence model with from 2 to 10 states would be appropriate.

    The other idea is to let the number of states correspond roughly to the average number of

    observations in a spoken version of the word. In this manner each state corresponds to an

    observation interval i.e., about 15 ms for the analysis we use. The former approach is used

    in our speech recognition system. Furthermore we restrict each word model to have the same

    number of states; this implies that the models will work best when they represent words with

    the same number of sounds.

    Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of states N in the HMM

    Above figure shows a plot of average word error rate versus N, for the case of recognition of

    isolated digits (i.e., a 10-word vocabulary). It can be seen that the error is somewhat

    insensitive to N, achieving a local minimum at N=6; however, differences in error rate for

    values of N close to 6 are small.

    The next issue is the choice of observation vector and the way it is represented. Since we are

    representing an entire region of the vector space by a single vector, distortion penalty is

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    51/71

    40

    associated with VQ. It is advantageous to keep the distortion penalty as small as possible.

    However, this implies a large size codebook, and that leads to problems in implementing

    HMMs with a large number of parameters. Although the distortion steadily decreases as M

    increases, only small decreases in distortion accrue beyond a value of M=32. Hence HMMs

    with codebooks sizes of from M=32 to 256 vectors have been in speech recognition

    experiments using HMMs. For the discrete symbol models we have used codebook to

    generate the discrete symbols with M=256 codewords.

    Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the size of the VQ, M (shown of

    a log scale)

    Another main issue is to initialize the parameters of HMM. The parameters that constitute

    any model are , A, and B. The values of are given by = [1 0 0 0 0........0] because the

    left-right model of HMM is used in our speech recognition system which always starts with

    first state and ends in the last state. The random values between 0 and 1 are assigned as the

    initial value to the elements of A and B parameters.

    3.3.2. Application of HMM

    Given the form of HMM, there are three basic problems of interest that must be solved for the

    model to be useful in real-world applications. These problems are the following:

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    52/71

    41

    3.3.2.1 Evaluation Problem: Calculating Parameters

    Given the observation sequence O = O1O2OT, and Markov Model = (A,B,) ,how do we efficiently compute P(O |) , the probability of the observation sequence, given

    the model?

    Solution:

    The aim of this problem is to find the probability of the observation sequence, O = (O1, O2,

    , OT ) given the model , i.e. P(O | ). Because the observations produced by states are

    assumed to be independent of each other and the time t, the probability of observation

    sequence, O = (O1, O2 , , OT ) being generated by a certain state sequence q can be

    calculated by a product:

    (

    |

    ,

    ) =

    (

    ).

    (

    ) ..

    (

    )

    And the probability of the state sequence, q can be found as:

    (|,) =.. ..The aim was to findP(O | ), and this probability ofO (given the model ) is obtained by

    summing the joint probability over all possible state sequence q, giving:

    (|) =(|,).(|,) This direct computation has one major drawback. It is infeasible due to the exponential

    growth of computations as a function of sequence length T. To be precise, it needs (2T-1)NT

    multiplications and NT-1 additions. An excellent tool which cuts the computational

    requirements to linear, relative to T, is the well-known forward algorithm. The forward

    algorithm hasN(N+1)(T-1)+1 multiplications andN(N-1)(T-1) additions.

    Forward Algorithm

    Initially consider a new forward probability variable t(i) , at instant tand state i , has thefollowing formula:

    t (i) P(O1, O2, O3, ......., Ot, qt Si/)

    This probability function could be solved forNstates andTobservations iteratively:

    Step 1: Initialization () =.()

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    53/71

    42

    Figure 3.18: Forward Procedure - Induction Step

    Step 2: Induction

    (

    ) =

    (

    )

    (

    )

    ,

    Step 3: Termination

    (|) =() This stage is just a sum of all the values of the probability function T(i)over all the states atinstant T. This sum will represent the probability of the given observations to be driven from

    the given model. That is how likely the given model produces the given observations.

    Backward Algorithm

    This procedure is similar to the forward procedure but it takes into consideration the state

    flow as if in backward direction from the last observation entity, instant T, till the first one,

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    54/71

    43

    instant 1. That means that the access to any state will be from the states that are coming just

    after that state in time.

    To formulate this approach let us consider the backward probability function t (i)which canbe defined as:

    (

    ) =

    (

    ,

    ,

    |

    =

    ,

    )

    Figure 3.19: Backward Procedures - Induction Step

    In analogy to the forward procedure we can solve fort(i) in the following two steps:1 - Initialization: () =, These initial values for s of all states at instant Tis arbitrarily selected.2 Induction:

    () =.() .() = , ,,,

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    55/71

    44

    3.3.2.2 Decoding Problem: Finding the best path

    Given the observation sequence O = O1O2OT , and Markov Model = (A,B,), findoptimal state sequence q = q1q2 qT .

    Solution:

    The problem is to find the optimal sequence of states, given the observation sequence and the

    model. This means that we have to find the optimal state sequence Q= ( q1 , q2 , q3,....., qT-1,

    qT) associated with the given observation sequence O = (O1 , O2, O3,........., OT-1 , OT )

    presented to the model = (A , B , ). The criteria of optimality here is to search for a singlebest state sequence through modified dynamic programming technique called Viterbi

    Algorithm.

    To explain the Viterbi Algorithm, the probability quantity t (i) is defined which representsthe maximum probability along the best probable state sequence path of a given observation

    sequence aftertinstants and being in state i. This quantity can be defined mathematically by:

    () = ,, [,,, = , |]The best state sequence is backtracked by another function t (j). The complete algorithm can

    be described by the following steps:

    Step 1: Initialization: () =(), () =Step 2: Recursion:

    () =()(), ,

    (

    ) =

    [

    (

    )

    ]

    (

    )

    ,

    ,

    Step 3: Termination: =[()] =[()]Step 4: Path (state sequence) backtracking:

    =( ),

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    56/71

    45

    Viterbi Algorithm can also be used to calculate theP(O/) approximately by considering theuse ofP*instead.

    Figure 3.20: Viterbi Search

    3.3.2.3 Training Problem: Estimating the Model Parameters

    Given the observation sequence O = O1O2OT , estimate parameters for Model = (A,B,)that maximizeP(O |) .

    Solution:

    This problem deals with the training issue which is the most difficult one in all the three

    cases. The task of this problem is to adjust the model parameters, (A, B, ), according to acertain optimality criteria. Baum-Welch Algorithm (ForwardBackward Algorithm) is one of

    the well-known techniques to solve the problem. It is an iterative method to estimate the new

    values for the model parameters. To explain the training procedure, first a posteriori

    probability function t(i) is defined, the probability of being in state i at instant t, given the

    observation sequence Oand the model as:() =( = |,)() =(, = |)(|) () = ()() ()()

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    57/71

    46

    Then another probability function t (i, j) is defined, the probability of being in state i atinstantt and going to state j at instantt+1, given the model and the observation sequence O.t (i, j)can be mathematically defined as:

    (,) =( = , = |,)

    Figure 3.21: Computation of t (i, j)

    From the definition of the forward and backward variables, we can write t (i, j)in the form

    (,) = ()()() ()()

    (

    ,

    ) =

    (

    )

    (

    )

    (

    )

    ()()()

    The relation between t(i) andt (i, j)can be easily deduced from their definitions :

    () =(,) Now, ift(i) is summed over all instants (excluding instant T) we get the expected number of

    times that state Si has left, or the number of times this state has been visited over all instants.

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    58/71

    47

    On the other hand if we sum t (i, j)over all instants (excluding T) we will get the expectednumber of transitions that have been made from i toj.

    From the behavior oft(i)andt (i, j)the following re-estimations of the model parameterscould be deduced:

    Initial state distribution:

    )( 1 ii

    Transition probabilities:

    1

    1

    1

    1

    )(

    ),(

    T

    t

    t

    T

    t

    t

    ij

    i

    ji

    a

    Emission probabilities:

    j

    kj

    sstateintimesofnumberexpected

    vsymbolobservingandsstateintimesofnumberexpected)( kbj

    3.3.3. Scaling

    t(i) consists of the sum of a large number of terms. Since transition matrix element (a) and

    emission matrix element (b) are less than 1, as t starts to get big, each term of t(i) starts to

    head exponentially to zero. For large t the dynamic range of t(i) computation will exceed

    the precision range of computer (even in double precision ). This is accomplished bymultiplying t(i) and t(i) by a scaling factor that is independent of i (i.e., it depends only on

    t), with the goal of keeping the scaled t(i) within the dynamic range of the computer for 1

    t T . Then at the end of the computation, the scaling coefficients are canceled out exactly.

    When using the Viterbi Algorithm, if logarithms are used to give the maximum likelihood

    state sequence, no scaling is required.

    1tat timesstateintimesofnumberexpected i i

    i

    ji

    sstatefromnstransitioofnumberexpected

    sstatetosstatefromnstransitioofnumberexpected ija

    T

    t

    t

    T

    vot

    t

    i

    j

    j

    kb kt

    1

    1

    )(

    )(

    )(

  • 7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

    59/71

    48

    +execute()

    interfaceAlg ori th m

    +doPreprocessing()+doPCMNormalization()

    -capturedSignal-processedSi