TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

7/23/2019 TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU

1/71

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

PULCHOWK CAMPUS

TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION

By:

GANESH TIWARI (063/BCT/510)

MADHAV PANDEY (063/BCT/514)

MANOJ SHRESTHA (063/BCT/518)

A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS

AND COMPUTER ENGINEERING IN PARTIAL FULFILLMENT OF THEREQUIREMENT FOR THE BACHELORS DEGREE IN COMPUTER

ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

LALITPUR, NEPAL

January, 2011


2/71

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

PULCHOWK CAMPUS

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

The undersigned certify that they have read, and recommended to the Institute of Engineering for

acceptance, a project report entitled Text-Prompted Remote Speaker Authentication submitted

by Ganesh Tiwari, Madhav Pandey and Manoj Shrestha in partial fulfillment of the requirements

for the Bachelors degree in Computer Engineering.

__________________________________

Supervisor, Dr. Subarna Shakya

Associate Professor

Department of Electronics and Computer Engineering

__________________________________

Internal Examiner,

_________________________________

External Examiner,

DATE OF APPROVAL:


3/71

COPYRIGHT

The author has agreed that the Library, Department of Electronics and Computer Engineering,

Pulchowk Campus, Institute of Engineering may make this report freely available for inspection.

Moreover, the author has agreed that permission for extensive copying of this project report for

scholarly purpose may be granted by the supervisors who supervised the project work recorded

herein or, in their absence, by the Head of the Department wherein the project report was done. It

is understood that the recognition will be given to the author of this report and to the Department

of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use

of the material of this project report. Copying or publication or the other use of this report for

financial gain without approval of to the Department of Electronics and Computer Engineering,

Pulchowk Campus, Institute of Engineering and authors written permission is prohibited.Request for permission to copy or to make any other use of the material in this report in whole or

in part should be addressed to:

Head

Department of Electronics and Computer Engineering

Pulchowk Campus, Institute of Engineering

Lalitpur, Kathmandu

Nepal


4/71

ACKNOWLEDGEMENT

We are very thankful to Institute of Engineering (IOE), Pulchowk Campus for offering the

course on major project. We also thank all teachers and staffs of Electronics and Computer

Engineering Department who assisted during the project conduction period by giving suitable

suggestions and lectures on different subject matters relating to the conduction and achievement

of the project goals.

We are very much obliged to Dr. Subarna Shakya, Department, Electronics and Computer

Engineering, IOE Pulchowk Campus, for their inspiration and valuable suggestions that we got

throughout the working period.

We would like to thank to the forum members of askmeflash.com, stackoverflow.com,

dsprelated.com for their quick response and value able opinion to our queries.

We also express our gratitude to all the friends and juniors who helped a lot for training data

collection.

Members of Project

Ganesh Tiwari (063BCT510)

Madhav Pandey (063BCT514)

Manoj Shrestha (063BCT518)

IOE, PULCHOWK CAMPUS


5/71

ABSTRACT

Biometric is physical characteristic unique to each individual. It has a very useful application in

authentication and access control.

The designed system is a text-prompted version of voice biometric which incorporates text-

independent speaker verification and speaker-independent speech verification system

implemented independently. The foundation for this joint system is that the speech signal

conveys both the speech content and speaker identity. Such systems are more-secure from

playback attack, since the word to speak during authentication is not previously set.

During the course of the project various digital signal processing and pattern classification

algorithms were studied. Short time spectral analysis was performed to obtain MFCC, energy

and their deltas as feature. Feature extraction module is same for both systems. Speaker

modeling was done by GMM and Left to Right Discrete HMM with VQ was used for isolated

word modeling. And results of both systems were combined to authenticate the user.

The speech model for each word was pre-trained by using utterance of 45 English words. The

speaker model was trained by utterance of about 2 minutes each by 15 speakers. While uttering

the individual words, the recognition rate of the speech recognition system is 92 % and speaker

recognition system is 66%. For longer duration of utterance (>5sec) the recognition rate of

speaker recognition system improves to 78%.


6/71

TABLE OF CONTENTS

PAGE OF APPROVAL.I

COPYRIGHT ................................................................................................................... 2

ACKNOWLEDGEMENT ................................................................................................. 3

ABSTRACT ..................................................................................................................... 4

TABLE OF CONTENTS .............................................................................................. V

LIST OF FIGURES .......................................................................................................... 1

LIST OF SYMBOLS AND ABBREVIATIONS ............................................................. IX

1. INTRODUCTION......................................................................................................... 1

1.2 Objectives ................................................................................................................... 2

2. LITERATURE REVIEW .............................................................................................. 32.1 Pattern Recognition..................................................................................................... 3

2.2 Generation of Voice ................................................................................................... 4

2.3 Voice as Biometric ..................................................................................................... 6

2.4 Speech Recognition .................................................................................................... 7

2.5 Speaker Recognition ................................................................................................... 7

2.5.1. Types of Speaker Recognition ............................................................................. 8

2.5.2. Modes of Speaker Recognition ............................................................................ 9

2.6 Feature Extraction for Speech/Speaker Recognition System ................................ ...... 10

2.6.1. Short Time Analysis .......................................................................................... 10

2.6.2. MFCC Feature ................................................................................................... 11

2.7 Speaker/Speech Modeling ......................................................................................... 12

2.7.1. Gaussian Mixture Model ................................................................................... 12

2.7.2. Hidden Markov Model ...................................................................................... 15

2.7.3. K-Means Clustering .......................................................................................... 19

3. IMPLEMENTATION DETAILS ................................................................................ 20

3.1 Pre-Processing and Feature Extraction ...................................................................... 20


7/71

3.1.1. Capture .............................................................................................................. 20

3.1.2. End point Detection and Silence Removal ......................................................... 21

3.1.3. PCM Normalization .......................................................................................... 22

3.1.4. Pre-emphasis ..................................................................................................... 22

3.1.5. Framing and Windowing ................................................................................... 23

3.1.6. Discrete Fourier Transform ............................................................................... 25

3.1.7. Mel Filter .......................................................................................................... 25

3.1.8. Cepstrum by Inverse Discrete Fourier Transform .............................................. 27

3.2 GMM Implementation .............................................................................................. 30

3.2.1. Block Diagram of GMM Based Speaker Recognition System, ........................... 30

3.2.2. GMM Training .................................................................................................. 31

3.2.3. Verification ....................................................................................................... 34

3.2.4. Performance Measure of Speaker Verification System....................................... 34

3.3 Implementation of HMM for Speech Recognition ..................................................... 36

3.3.1. Isolated Word Recognition ................................................................................ 39

3.3.2. Application of HMM ......................................................................................... 40

3.3.3. Scaling .............................................................................................................. 47

4. UML CLASS DIAGRAMS OF THE SYSTEMS ........................................................ 48

5. DATA COLLECTION AND TRAINING ................................................................... 50

6. RESULTS ................................................................................................................... 51

7. APPLICATION AREA ............................................................................................... 52

8. CONCLUSION ........................................................................................................... 52

REFERENCES ............................................................................................................... 53

APPENDIX A: BlazeDS Configuration for Remoting Service ........................................ 54

APPENDIX B: Words Used for HMM Training.............................................................. 55

APPENDIX C: Development Tools and Environment ..................................................... 56

APPENDIX D: Snapshots of Output GUI ....................................................................... 57


8/71

LIST OF FIGURES

Figure 1.1: System Architecture..1

Figure 1.2: Block Diagram of Text Prompted Speaker Verification System .2

Figure 2.1: General block diagram of pattern recognition system ..3

Figure 2.2: Vocal Schematic....4

Figure 2.3: Audio Sample for /i:/ phoneme ....5

Figure 2.4: Audio Magnitude Spectrum for /i:/ phoneme ..6

Figure 2.5: GMM with four Gaussian components and their equivalent model ..13

Figure 2.6: Ergodic Model of HMM ....17

Figure 2.7: Left to Right HMM ...18

Figure 3.1: Pre-Processing and Feature Extraction..20

Figure 3.2: Input signal to End-point detection system ...22

Figure 3.3: Output signal from End point Detection System .. 22

Figure 3.4: Signal before Pre-Emphasis . .23

Figure 3.5: Signal after Pre-Emphasis 23

Figure 3.6: Frame Blocking of the Signal 23

Figure 3.7: Hamming window . 24

Figure 3.8: A single frame before and after windowing .. 24

Figure 3.9: Equally spaced Mel values 26

Figure 3.10: Mel Scale Filter Bank .. 26


9/71

Figure 3.11: Block diagram of GMM based Speaker Recognition System . 30

Figure 3.12: Equal Error Rate (EER) ... 35

Figure 3.13: Speech Recognition algorithm flow .36

Figure 3.14: Pronunciation model of word TOMATO .37

Figure 3.15: Vector Quantization ..38

Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of

states N in the HMM .39

Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the

size of the VQ, M (shown of a log scale) . 40

Figure 3.18: Forward Procedure - Induction Step .. 42

Figure 3.19: Backward Procedures - Induction Step .. 43

Figure 3.20: Viterbi Search . ... 45

Figure 3.21: Computation of t (i, j) . 46

Figure 4.1: UML diagram of Client System .48

Figure 4.2: UML Diagram of Server System ....49


10/71

LIST OF SYMBOLS AND ABBREVIATIONS

GMM/HMM Model

T Threshold

Variance

() Likelihood Ratio

Mean

Initial State Distribution

A State Transition Probability Distribution

Observation Symbol Probability Distribution

Cm Covariance Matrix for mth

Component

State at time t

Wm Weighting Factor for mth

Gaussian Component

Feature Vector

AIR Adobe Integrated Runtime

DC Direct Current

DCT Discrete Cosine Transform

DFT Discrete Fourier Transform

DHMM Discrete Hidden Markov Model

DTW Dynamic Time Warping

EM Expectation-Maximization

FAR False Acceptance Rate

FRR False Rejection Rate

GMM Gaussian Markov Model

HMM Hidden Markov Model


11/71

LPC Linear Prediction Coding

MFCC Mel Frequency Cepstral Coefficient

ML Maximum Likelihood

PDF Probability Distribution Function

PLP Perceptual Linear Prediction

RIA Rich Internet Application

RPC Remote Procedure Call

SID Speaker IDentification

TER Total Error Rate

UBM Universal Background Model

UML Unified Modeling Language

VQ Vector Quantization

WTP Web Tool Platform


12/71

1

1. INTRODUCTION

Biometrics is, in the simplest definition, something you are. It is a physical characteristic

unique to each individual such as fingerprint, retina, iris, speech. Biometrics has a very useful

application in security; it can be used to authenticate a persons identity and control access to

a restricted area, based on the premise that the set of these physical characteristics can be

used to uniquely identify individuals.

Speech signal conveys two important types of information, the primarily the speech content

and on the secondary level, the speaker identity. Speech recognizers aim to extract the lexical

information from the speech signal independently of the speaker by reducing the inter-

speaker variability. On the other hand, speaker recognition is concerned with extracting the

identity of the person speaking the utterance. So both speech recognition and speaker

recognition system is possible from same voice input.

Text Prompted Remote Speaker Authentication is a voice biometric system that authenticates

a user before permitting the user to log into a system on the basis of the user's input voice. It

is a web application. Voice signal acquisition and feature extraction is done on the client.

Training and Authentication task based on the voice feature obtained from client side is done

on Server. The authentication task is based on text-prompted version of speaker recognition,

which incorporates both speaker recognition and speech recognition. This joint

implementation of speech and speaker recognition includes text-independent speaker

recognition and speaker-independent speech recognition. Speaker Recognition verifies

whether the speaker is claimed one or not while Speech Recognition verifies whether or not

spoken word matches the prompted word.

The client side is realized in Adobe Flex whereas the server side is realized in Java. The

communication between these two cross-platforms is made possible with the help of Blaze

DSs RPC remote object.

Browser Application in Client(Flex)

Server(Java)User BlazeDS RPC

Figure 1.1: System Architecture


13/71

2

Mel Filter Cepstral Coefficient (MFCC) is used as feature for both speech and speaker

recognition task. We also combined energy features and delta and delta-delta features of

energy and MFCC.

After calculating feature, Gaussian Mixture Model (GMM) is used to model the speaker

modeling and Left to Right Discrete Hidden Markov Model with Vector Quantization(DHMM/VQ) for speech modeling.

Based on the speech model the system decides whether or not the uttered speech matches

what was prompted to utter. Similarly, based on the speaker model, the system decides

whether or not the speaker is claimed one. Then the speaker is authenticated with the help of

combined result of these two tests.

Referring to figure 1.2, the feature extraction module is same for both speech and speaker

recognition. And these recognition systems are implemented independent of each other.

Figure 1.2: Block Diagram of Text Prompted Speaker Verification System

1.2 Objectives

The objectives of this project are:

To design and build a speaker verification system

To design and build a speech verification system

To implement these systems jointly to control remote access to secret area


14/71

3

2. LITERATURE REVIEW

2.1 Pattern Recognition

Pattern recognition, one of the branches of artificial intelligence, sub-section of machine

learning, is the study of how machines can observe the environment, learn to distinguishpatterns of interest from their background, and make sound and reasonable decisions about

the categories of the patterns. A pattern can be a fingerprint image, a handwritten cursive

word, a human face, or a speech signal, sales pattern etc

The applications of pattern recognition include data mining, document classification,

financial forecasting, organization and retrieval of multimedia databases, and biometrics

(personal identification based on various physical attributes such as face, retina, speech, ear

and fingerprints).

The essential steps of pattern recognition are: Data Acquisition, Preprocessing, Feature

Extraction, Training and Classification.

Figure 2.1: General block diagram of pattern recognition system

Features are used to denote the descriptor. Features must be selected so that they are

discriminative and invariant. They can be represented as a vector, matrix, tree, graph, or

string. Theyare ideally similar for objects in the same class and very different for objects in

different class.

Pattern class is a family of patterns that share some common properties. Pattern recognition

by machine involves techniques for assigning patterns to their respective classes

automatically and with as little human intervention as possible.

Learning and Classification usually use one of the following approaches: Statistical Pattern

Recognition is based on statistical characterizations of patterns, assuming that the patterns are


15/71

4

generated by a probabilistic system. Syntactical (or Structural) Pattern Recognition is based

on the structural interrelationships of features.

Given a pattern, its recognition/classification may consist of one of the following two tasks

according to the type of learning procedure: 1) Supervised Classification (e.g., Discriminant

Analysis) in which the input pattern is identified as a member of a predefined class. 2)Unsupervised Classification (e.g., clustering) in which the pattern is assigned to a previously

unknown class.

2.2 Generation of Voice

Speech begins with the generation of an airstream, usually by the lungs and diaphragm -

process called initiation. This air then passes through the larynx tube, where it is modulated

by the glottis (vocal chords). This step is called phonation or voicing, and is responsible for

the generation of pitch and tone. Finally, the modulated air is filtered by the mouth, nose, and

throat - a process called articulation - and the resultant pressure wave excites the air.

Figure 2.2: Vocal Schematic

Depending upon the positions of the various articulators different sounds are produced.

Position of articulators can be modeled by linear time- invariant system that has frequency

response characterized by several peaks called formants. The change in frequency of

formants characterizes the phoneme being articulated.


16/71

5

As a consequence of this physiology, we can notice several characteristics of the frequency

domain spectrum of speech. First of all, the oscillation of the glottis results in an underlying

fundamental frequency and a series of harmonics at multiples of this fundamental.

This is shown in the figure below, where we have plotted a brief audio waveform for the

phoneme /i:/ and its magnitude spectrum. The fundamental frequency (180 Hz) and itsharmonics appear as spikes in the spectrum. The location of the fundamental frequency is

speaker dependent, and is a function of the dimensions and tension of the vocal chords. For

adults it usually falls between 100 Hz and 250 Hz, and females average significantly higher

than that of males.

Figure 2.3: Audio Sample for /i:/ phoneme showing stationary property of phonemes for a short period

The sound comes out in phonemes which are the building blocks of speech. Each phoneme

resonates at a fundamental frequency and harmonics of it and thus has high energy at those

frequencies in other words have different formats. It is the feature that enables the

identification of each phoneme at the recognition stage.

The variations in inter-speaker features of speech signal during utterance of a word are

modeled in word training in speech recognition. And for speaker recognition the intra-

speaker variations in features in long speech content is modeled.

0 500 1000 1500 2000 2500-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Samples

Amplitude


17/71

6

Figure 2.4: Audio Magnitude Spectrum for /i:/ phoneme showing fundamental frequency and its harmonics

Besides the configuration of articulators, the acoustic manifestation of a phoneme is affectedby:

Physiology and emotional state of speaker

Phonetic context

Accent

2.3 Voice as Biometric

The underlying premise for voice authentication is that each persons voice differs in pitch,

tone, and volume enough to make it uniquely distinguishable. Several factors contribute to

this uniqueness: size and shape of the mouth, throat, nose, and teeth (articulators) and the

size, shape, and tension of the vocal cords. The chance that all of these are exactly the same

in any two people is very low.

Voice Biometric has following advantages from other form of biometrics

Natural signal to produce

Implementation cost is low since, doesnt require specialized input device

Acceptable by user

Easily mixed with other form of authentication system for multifactor authentication

Only biometric that allows users to authenticate remotely

0 500 1000 1500 2000 2500 3000 3500 40000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

Frequency (Hz)

|Y(f)|


18/71

7

2.4 Speech recognition

Speech is the dominant means for communication between humans, and promises to be

important for communication between humans and machines, if it can just be made a little

more reliable.

Speech recognition is the process of converting an acoustic signal to a set of words. The

applications include voice commands and control, data entry, voice user interface, automating

the telephone operators job in telephony, etc. They can also serve as the input to natural

language processing.

There is two variant of speech recognition based on the duration of speech signal : Isolated

word recognition, in which each word is surrounded by some sort of pause, is much easier

than recognizing continuous speech, in which words run into each other and have to be

segmented.

Speech recognition is a difficult task because of the many source of variability associated

with the signal such as the acoustic realizations of phonemes, the smallest sound units of

which words are composed, are highly dependent on the context. Acoustic variability can

result from changes in the environment as well as in the position and characteristics of the

transducer. Third, within speaker variability can result from changes in the speaker's physical

and emotional state, speaking rate, or voice quality. Finally, differences in socio linguistic

background, dialect, and vocal tract size and shape can contribute to cross-speaker variability.

Such variability is modeled in various ways. At the level of signal representation, the

representation that emphasizes the speaker independent features is developed.

2.5 Speaker Recognition

Speaker recognition is the process of automatically recognizing who is speaking on the basis

of individuals information included in speech waves. Speaker recognition can be classified

into identification and verification. Speaker recognition has been applied most often as means

of biometric authentication.


19/71

8

2.5.1. Types of Speaker Recognition

2.5.1.1 Speaker Identification

Speaker identification is the process of determining which registered speaker provides a

given utterance. In Speaker IDentification (SID) system, no identity claim is provided, the

test utterance is scored against a set of known (registered) references for each potential

speaker and the one whose model best matches the test utterance is selected.

There is two types of speaker identification task closed-set and open-set speaker

identification.

In closed-set, the test utterance belongs to one of the registered speakers. During testing, a

matching score is estimated for each registered speaker. The speaker corresponding to the

model with the best matching score is selected. This requires N comparisons for a populationof N speakers.

In open-set, any speaker can access the system; those who are not registered should be

rejected. This requires another model referred to as garbage model or imposter model or

background model, which is trained with data provided by other speakers different from the

registered speakers. During testing, the matching score corresponding to the best speaker

model is compared with the matching score estimated using the garbage model. In order to

accept or reject the speaker, making the total number of comparisons equal to N + 1. Speaker

identification performance tends to decrease as the population size increases.2.5.1.2 Speaker verification

Speaker verification, on the other hand, is the process of accepting or rejecting the identity

claim of a speaker. That is, the goal is to automatically accept or reject an identity that is

claimed by the speaker. During testing, a verification score is estimated using the claimed

speaker model and the anti-speaker model. This verification score is then compared to a

threshold. If the score is higher than the threshold, the speaker is accepted, otherwise, thespeaker is rejected. Thus, speaker verification, involves a hypothesis test requiring a simple

binary decision: accept or reject the claimed identity regardless of the population size. Hence,

the performance is quite independent of the population size, but it depends on the number of

test utterances used to evaluate the performance of the system.


20/71

9

2.5.2. Modes of Speaker Recognition

There are 3 modes in which speaker verification/identification can be done.

2.5.2.1 Text Independent

In text independent mode, the system relies only on the voice characteristics of the speaker;

the lexical content of the utterance is not used. System models the characteristics of his

speech which show up irrespective of what one is saying. This mode is used in surveillance

or forensic applications where there is no control over the speakers to access the system. The

test utterances can be different from those used for enrollment; hence, text-independent

speaker verification needs a large and rich training data set to model the characteristics of the

speaker's voice and to cover the phonetic space.

A large training set and long test segments is required to appropriately model the featurevariations from current user in uttering different phonemes, than that for text-dependent.

2.5.2.2 Text Dependent

In the text dependent mode of verification, the user is expected to say a pre-determined text -

a voice password. Since recognition is based on the speaker characteristics as well as the

lexical content of the password, text dependent speaker recognition systems are generally

more robust and achieve good performance. However, this system is not yet used in large

scale due to fear of playback attack, since, the system has a priori knowledge about the

password i.e., the training and the test texts are the same. The speaker model encodes the

speaker's voice characteristics associated with the phonemic or syllabic content of the

password.

2.5.2.3 Text-prompted

Both text-dependent and text-independent systems are susceptible to fraud, since for typical

applications the voice of a speaker could be captured, recorded, and reproduced. To limit this

risk, a particular kind of text-dependent speaker verification systems based on prompted text

has been developed. The password i.e., the text to speak is not pre-determined; rather he/she

is asked to speak a prompted text (digits or word or phrase). If the number of distinct random

passwords is large, the playback attack is not feasible. Hence the text prompted system is

more secure.


21/71

10

As in the case of text-independent systems, the text-prompted systems also need a large and

rich training data set for each registered speaker to create robust speaker-dependent models.

Because of that reason, we have chosen text prompted system.

2.6 Feature Extraction for speech/speaker recognition system

Signal representation or coding from short-term spectrum into feature vectors is one of the

most important steps in automatic speaker recognition and continues being subject of

research. Many different techniques have been proposed in the literature and generally they

are based on speech production models or speech perception models.

Goal of feature extraction is to transform the input waveform into a sequence of acoustic

feature vectors, each vector representing the information in a small time window of the

signal. Feature extraction transforms high-dimensional input signal into lower dimensional

vectors. For speaker recognition purposes, optimal feature has the following properties

1. High inter-speaker variation,

2. Low intra-speaker variation,

3. Easy to measure,

4. Robust against disguise and mimicry,

5. Robust against distortion and noise,

6. Maximally independent of the other features.

2.6.1. Short time analysis

The analysis at spectral level of the speech signal is based on classic Fourier analysis to the

whole speech signal. However, an exact definition of Fourier transform cannot be directly

applied because speech signal cannot be considered stationary due to constant changes in the

articulatory system within each speech utterance. To solve these problems, speech signal is

split into a sequence of short segments in such a way that each one is short enough to be

considered pseudo-stationary. The length of each segment, also called window or frame,

ranges between 10 and 40ms (in such a short time period our articulatory system is not able

to significantly change). Finally, a feature vector will be extracted from the short-time

spectrum in each window. The whole process, known as short-term Spectral analysis,


22/71

11

2.6.2. MFCC Feature

The commonly used feature extraction method for speech/ speaker recognition is LPC (linear

prediction coding), MFCC (Mel Frequency Cepstral Coefficients) and PLP (Perceptual

Linear Prediction). LPC is based on assumption that a speech sample can be approximated by

a linearly weighted summation of determined number of preceding samples. PLP iscalculated in a similar way as LPC coefficients, but previous transformations are carried out

in the spectrum of each window aiming at introducing about human hearing behavior.

The most popular feature extraction method, MFCC mimic the human hearing behavior by

emphasizing lower frequencies and penalizing higher frequencies.

The Mel scale, proposed by Stevens, Volkman and Newman in 1937 is a perceptual scale of

pitches judged by listeners to be equal in distance from one another.

The Mel scale is based on an empirical study of the human perceived pitch or frequency.

Human hearing, however, is not equally sensitive at all frequency bands. It is less sensitive at

higher frequencies, roughly above 1000 Hertz. It is a unit of pitch defined so that pairs of

sounds which are perceptually equidistant in pitch are separated by an equal number of Mels.

The mapping between frequency in Hertz and the Mel scale is linear below 1000 Hz and the

logarithmic above 1000 Hz.

(

) = 2595log

1 +

700

Modeling this property of human hearing during feature extraction improves speech

recognition performance. The form of the model used in MFCCs is to warp the frequencies

output by the DFT onto the Mel scale. During MFCC computation, this insight is

implemented by creating a bank of filters which collect energy from each frequency band.


23/71


24/71

13

recognition systems due to their capability of representing large class of sample distributions.

Like K-Means, Gaussian Mixture Models (GMM) can be regarded as a type of unsupervised

learning or clustering methods. GMM is based on clustering technique, where the entire set

of experimental data set is modeled by a mixture of Gaussians. But unlike K-Means, GMMs

are able to build soft clustering boundaries, i.e., points in space can belong to any class with a

given probability.

In a Gaussian mixture distribution, its density function is just a convex combination (a linear

combination in which all coefficients or weights sum to one) of Gaussian probability density

functions:

Figure: 2.5: GMM with four Gaussian components and their equivalent model

Mathematically, A GMM is the weighted sum of M Gaussian component densities given bythe equation

(/) = .(/ ,) where,

is a k dimensional random vector,wm are the mixture weights that shows the relative importance of each component and

satisfies the constraint that = 1. (/ ,), m=1,2,,M are the component densities where each componentdensity is a k-dimensional Gaussian function (pdf) of the form

(/ ,) = 1(2).|| exp{1

2( ).(( ))}


25/71

14

Where,

s the mean vector of length k of mth Gaussian PDF,Cm is the covariance matrix of kk of m

thGaussian PDF

Thus the complete Gaussian Mixture Model is parameterized by mixture weights, mean

vectors and covariance matrices for all component densities. The parameters are collectively

represented by the notation,

= {,,}, m = 1, 2,, MThese parameters are estimated in training section. For speaker recognition system, each

speaker is represented by a GMM and is referred by his/her model .

GMM is widely used in speaker modeling and classification due to its two important benefits:

first the individual Gaussian component in a speaker-dependent GMM are interpreted to

represent some broad acoustic classes such as speaker-dependent vocal tract configurations

that are useful for modeling speaker identity. A speaker voice can be characterized by a set of

acoustic classes representing some broad phonetic events such as vowels, nasals, fricatives.

These acoustic classes reflect some general speaker-dependent vocal tract configurations that

are useful for characterizing speaker identity. The spectral shape of the i th acoustic class can

in turn be represented by mean of the ith component density and variations of the average

spectral shape can be represented by the covariance matrix. These acoustic classes are hidden

before training. Secondly Gaussian mixture density provides a smooth approximation to the

long term sample distribution of training utterances by a given speaker. The unimodal

Gaussian speaker model represents a speakers feature distribution by a mean vector and

covariance matrix and the VQ model represents a speakers distribution by a discrete set of

characteristic templates. GMM acts as a hybrid between these two models using a discrete set

of Gaussian functions, each with their own mean and covariance matrix to allow better

modeling capability.


26/71

15

2.7.2. Hidden Markov Model

In general, a Markov model is a way of describing a process that goes through a series of

states. The model describes all the possible paths through the state space and assigns a

probability to each one. The probability of transitioning from the current state to another one

depends only on the current state, not on any prior part of the path.

HMMs can be applied in many fields where the goal is to recover a data sequence that is not

immediately observable. Common applications include: Cryptanalysis, Speech recognition,

Part-of-speech tagging, Machine translation, Partial discharge, Gene prediction, Alignment of

bio-sequences, Activity recognition.

2.7.2.1 Discrete Markov Processes

The transition probability

with N distinct states,

,

,

,,

, for the first order

Markov chain is given by:

= = = , 1 , where is the state at time t.

The state transition coefficients have the following properties (due to standard stochasticconstraints):

0

,

= 1The transition probabilities for all states in a model can be described by a transition

probability matrix:

A =

The initial state distribution matrix is given by:

= =( = 1) =( = 2) =( =)


27/71

16

The stochastic property for initial state distribution vector is:

= 1 where the

is defined as:

= ( = ),1 The Markov model can be described by

= (,)This stochastic process could be called an observable Markov model since the output of the

process is the set of states at each instant of time, where each state corresponds to physical

(observable) event.

2.7.2.3 Hidden Markov Model

Markov model is too restrictive to be applicable to many problems of interest. So the concept

of Markov model is extended to Hidden Markov model to include the case where the

observation is a probabilistic function of the state. The resulting model is doubly embedded

stochastic process with an underlying stochastic process that is not observable (i.e. hidden),

but can only be observed through another set of stochastic processes that produce the

sequence of observations. The difference is that in Markov Chain the output state is

completely determined at each time t. In the Hidden Markov Model the state at each time t

must be inferred from observations. An observation is a probabilistic function of a state.

Elements of HMM

The HMM is characterized by the following:

1) Set of hidden states

S = {S1,S2,,SN} and

state at time t, qt S

2) Set of observation symbols per state

V = {v1,v2,,vM}

observation at time t, Ot V


28/71

17

3) The initial state distribution

={ i } i = P[q1 = S1] 1 i N

4) State transition probability distribution

A = {aij} aij =P[qt+1 = Si|qt = Si] 1 i, j N

5) Observation symbol probability distribution in state j

B = {bj(k)} bj(k) = P[vkat t|qt = Sj] 1 j N, 1 k M

Normally, an HMM is typically written as: = (,,)2.7.2.4 Types of HMMS

An ergodic or fully connected HMMs has the property that every state can be reached from

every other state in a finite number of steps. This type of model has the property that every

aij coefficient is positive. For some applications, other types of HMMs have been found toaccount for observed properties of the signal being modeled better than the standard ergodic

model.

Figure 2.6: Ergodic Model of HMM

One such model is left-right model or Bakis model because the underlying state sequence

associated with the model has the property that as time increases the state index increases (or

stays the same), i.e. the state proceed from left to right. Clearly, the left-right type of HMMs


29/71

18

has the desirable property that it can readily model signals whose properties change over time

e.g., Speech.

State1 State2State3 State4

a11

a13

a12 a23a22 a34

a44a33

a24

Figure 2.7: Left to Right HMM

The properties of left-right HMMs are:

1) The state transition coefficients have the property = 0, < i.e., no transition is allowed to states whose indices are lower than the current state.

3) The state transition coefficient for the last state in a left-right model are specified as

= 12) The initial state probabilities have the property

i

= 1,

= 1

= 0, 1Since the state sequence must begin in state 1 and end in state N.

With left-right models, additional constraints are placed on the state transition coefficients to

make sure that large changes in state indices do not occur, hence a constraint of the form

= 0, > +is often used. The value of is 2 in this speech recognition system, i.e., no jumps of more

than 2 states are allowed. The form of the state transition matrix for = 2 and N=4 is asfollows.


30/71

19

2.7.3. K-Means Clustering

Clustering can be considered the most important unsupervised learning problem; so, as every

other problem of this kind, it deals with finding a structure in a collection of unlabeled data.

A loose definition of clustering could be the process of organizing objects into groups whose

members are similar in some way. A cluster is therefore a collection of objects which aresimilar between them and are dissimilar to the objects belonging to other clusters.

In statistics and machine learning,k-means clustering is a method of cluster analysis which

aims to partition n observations into k clusters in which each observation belongs to the

cluster with the nearest mean.

The algorithm is composed of the following steps:

1. Place K points into the space represented by the objects that are being clustered.

These points represent initial group centroids.

2. Assign each object to the group that has the closest centroid.

3. When all objects have been assigned, recalculate the positions of the K centroids.

4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation

of the objects into groups from which the metric to be minimized can be calculated.

Both the clustering process and the decoding process require a distance metric ordistortion

metric, that specifies how similar two acoustic feature vectors are. Thedistance metric is used

to build clusters, to find a prototype vector for each cluster, andto compare incoming vectors

to the prototypes. The simplest distance metric for acoustic feature vectors is Euclidean

distance. Euclidean distance is the distance in N-dimensional space between the two points

defined by the two vectors.


31/71

20

3. IMPLEMENTATION DETAILS

The implementation of joint speaker/speech recognition system includes common pre-

processing and feature extraction module, text independent speaker modeling and

classification by GMM and speaker independent speech modeling and classification by

HMM/VQ.

3.1 Pre-Processing and Feature Extraction

Starting from the capturing of audio signal, feature extraction consists of the following steps

as shown in the block diagram below:

Pre-emphasis

Window DFTMel-Filter

bankLog

IDFT

DeltasEnergy

SPEECHSIGNAL

MFCC 12Coefficients

1 Energy

Feature

12 MFCC

12 MFCC12 MFCC

1 energy1 energy

1 energy

SilenceRemoval

Framing

CMS

Figure 3.1: Pre-Processing and Feature Extraction

3.1.1. Capture

The first step in processing speech is to convert the analog representation (first air pressure,

and then analog electric signals in a microphone) into a digital signal x[n], where n is an

index over time. Analysis of the audio spectrum shows that nearly all energy resides in the

band between DC and 4 kHz, and beyond 10 kHz there is virtually no energy whatsoever

Used sound format

22050 Hz

16-bits, Signed

Little Endian

Mono Channel

Uncompressed PCM


32/71

21

3.1.2. End point detection and Silence removal

The captured audio signal may contain silence at different positions such as beginning of

signal, in between the words of a sentence, end of signal. etc. If silent frames are included,

modeling resources are spent on parts of the signal which do not contribute to the

identification. The silence present must be removed before further processing.

There are several ways for doing this: most popular are Short Time Energy and Zeros

Crossing Rate. But they have their own limitation regarding setting thresholds as an ad hoc

basis. The algorithm we used [Ref.4] uses statistical properties of background noise as well as

physiological aspect of speech production and does not assume any ad hoc threshold. It

assumes that background noise present in the utterances is Gaussian in nature.

Usually first 200msec or more (we used 4410 samples for the sampling rate 22050

samples/sec) of a speech recording corresponds to silence (or background noise) because the

speaker takes some time to read when recording starts.

Endpoint Detection Algorithm

Step 1: Calculate the mean () and standard deviation () of the first 200ms samples of the

given utterance. The background noise is characterized by this and.

Step 2: Go from 1stsample to the last sample of the speech recording. In each sample, check

whether one-dimensional Mahalanobis distance functions i.e. |x-|/greater than 3 or not. If

Mahalanobis distance function is greater than 3, the sample is to be treated as voiced sample

otherwise it is an unvoiced/silence.

The threshold reject the samples up to 99.7% as per given by P[|x|3]=0.997 in a

Gaussian Distribution thus accepting only the voiced samples.

Step 3: Mark the voiced sample as 1 and unvoiced sample as 0. Divide the whole speech

signal into 10 ms non-overlapping windows. Represent the complete speech by only zeros

and ones.

Step 4: Consider there areMnumber of zeros andNnumber of ones in a window. IfM N

then convert each of ones to zeros and vice versa. This method adopted here keeping in mind

that a speech production system consisting of vocal cord, tongue, vocal tract etc. cannot

change abruptly in a short period of time window taken here as 10ms.


33/71

22

Step 5: Collect the voiced part only according to the labeled 1 samples from the windowed

array and dump it in a new array. Retrieve the voiced part of the original speech signal from

labeled 1 sample.

Figure 3.2: Input signal to End-point detection system

Figure 3.3: Output signal from End point Detection System

3.1.3. PCM Normalization

The extracted pulse code modulated values of amplitude is normalized, to avoid amplitude

variation during capturing.

3.1.4. Pre-emphasis

Usually speech signal is pre-emphasized before any further processing, if we look at the

spectrum for voiced segments like vowels, there is more energy at lower frequencies than the

higher frequencies. This drop in energy across frequencies is caused by the nature of the

glottal pulse. Boosting the high frequency energy makes information from these higher

formants more available to the acoustic model and improves phone detection accuracy.

The pre-emphasis filter is a first-order high-pass filter. In the time domain, with input x[n]

and 0.9 1.0, the filter equation is:

y[n] =x[n]x[n1]

We used=0.95.

0 1 2 3 4 5 6 7 8 9

x 104

-1

-0.5

0

0.5

1

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

-1

-0.5

0

0.5

1


34/71

23

Figure 3.4: Signal before Pre-Emphasis

Figure 3.5: Signal after Pre-Emphasis

3.1.5. Framing and windowing

Speech is a non-stationary signal, meaning that its statistical properties are not constant

across time. Instead, we want to extract spectral features from a small window of speech that

characterizes a particular sub phone and for which we can make the (rough) assumption that

the signal is stationary (i.e. its statistical properties are constant within this region).

We used frame block of 23.22ms with 50% overlapping i.e., 512 samples per frame.

Figure 3.6: Frame Blocking of the Signal

0 2000 4000 6000 8000 10000 120000

0.01

0.02

0.03

0.04

0.05

Frequency (Hz)

|Y(f)|

0 2000 4000 6000 8000 10000 120000

1

2

3

4

5x 10

-3

Frequency (Hz)

|Y(f)|


35/71

24

The rectangular window (i.e., no window) can cause problems, when we do Fourier analysis;

it abruptly cuts of the signal at its boundaries. A good window function has a narrow main

lobe and low side lobe levels in their transfer functions, which shrinks the values of the signal

toward zero at the window boundaries, avoiding discontinuities. The most commonly used

window function in speech processing is theHamming window defined as follows:

() = 0.54 0.46cos2( 1) 1 ,1

Figure 3.7: Hamming window

The extraction of the signal takes place by multiplying the value of the signal at time n,

sframe[n], with the value of the window at time n, Sw[n]:

y[n] = Sw[n]Sframe[n]

Figure 3.8: A single frame before and after windowing

0 10 20 30 40 50 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hamming Window

0 200 400 600 800 1000 1200-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0 200 400 600 800 1000 1200-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05


36/71

25

3.1.6. Discrete Fourier Transform

A Discrete Fourier Transform (DFT) of the windowed signal is used to extract the frequency

content (the spectrum) of the current frame. The tool for extracting spectral information i.e.,

how much energy the signal contains at discrete frequency bands for a discrete-time

(sampled) signal is the Discrete Fourier Transform or DFT. The input to the DFT is a

windowed signal x[n]...x[m], and the output, for each ofN discrete frequency bands, is a

complex numberX[k] representing the magnitude and phase of that frequency component in

the original signal.

= ()() , = 0,1,2,, 1The commonly used algorithm for computing the DFT is the Fast Fourier Transform or in

short FFT.

3.1.7. Mel Filter

For calculating the MFCC, first, a transformation is applied according to the following

formula:

(

) = 2595log

1 +

700

Where, x is the linear frequency.

Then, a filter bank is applied to the amplitude of the mel-scaled spectrum.

The Mel frequency warping is most conveniently done by utilizing a filter bank with filters

centered according to Mel frequencies. The width of the triangular filters varies according to

the Mel scale, so that the log total energy in a critical band around the center frequency is

included. The centers of the filters are uniformly spaced in the mel scale.


37/71

26

Figure 3.9: Equally spaced Mel values

The result of Mel filter is information about distribution of energy at each Mel scale band.

We obtain a vector of outputs (12 coeffs) from each filter.

Figure 3.10: Triangular filter bank in frequency scale

We have used 30 filters in the filter bank.


38/71

27

The Mel frequency m can be computed from the raw acoustic frequency as follows:

= + 1 + 1

+ + 1

,

= 1,2,,

1

where,

= = ;

= 2 =2

= {}+ {} = 1,2,3,, 1() = 2595ln1 +

700

= 700.10 13.1.8. Cepstrum by Inverse Discrete Fourier Transform

Cepstrum transform is applied to the filter outputs in order to obtain MFCC feature of each

frame. The triangular filter outputs Y (i), i=0, 1, 2 M are compressed using logarithm, and

discrete cosine transform (DCT) is applied. Here, M is equal to number of filters in filter

bank i.e., 30.

[] = log() cos ( 12)

Where, C[n] is the MFCC vector for each frame.


39/71

28

The resulting vector is called the Mel-frequency cepstrum (MFC), and the individual

components are the Mel-frequency cepstral coefficients (MFCCs). We extracted 12 features

from each speech frame.

3.1.9. Post Processing

3.1.9.1 Cepstral Mean Subtraction (CMS)

A speech signal may be subjected to some channel noise when recorded, also referred to as

the channel effect. A problem arises if the channel effect when recording training data for a

given person is different from the channel effect in later recordings when the person uses the

system. The problem is that a false distance between the training data and newly recorded

data is introduced due to the different channel effects. The channel effect is eliminated by

subtracting the Mel-cepstrum coefficients with the mean Mel-cepstrum coefficients:

() =() 1() , = 1,2,,123.1.9.2 The energy feature

The energyin a frame is the sum over time of the power of the samples in the frame; thus for

a signalx in a window from time sample t1 to time sample t2, the energy is:

= [] 3.1.9.3 Delta feature

Another interesting fact about the speech signal is that it is not constant from frame to frame.

Co-articulation (influence of a speech sound during another adjacent or nearby speech sound)

can provide a useful cue for phone identity. It can be preserved by using delta features.

Velocity (delta) and acceleration (delta delta) coefficients are usually obtained from the static

window based information. This delta and delta delta coefficients model the speed andacceleration of the variation of cepstral feature vectors across adjacent windows.

A simple way to compute deltas would be just to compute the difference between frames;

thus the delta value d(t) for a particular cepstral value c(t) at time tcan be estimated as:

() = [] =[] []


40/71

29

The differentiating method is simple, but since it acts as a high-pass filtering operation on the

parameter domain, it tends to amplify noise. The solution to this is linear regression, i.e. first-

order polynomial, the least squares solution is easily shown to be of the following form:

[

] =

[

]

Where, M is regression window size. We used M=4.

3.1.9.4 Composition of Feature Vector

We calculated 39 Features from each frame

12 MFCC Features

12 Delta MFCC

12 Delta-Delta MFCC

1 Energy Feature

1 Delta Energy Feature

1 Delta-Delta Energy Feature


41/71

30

3.2 GMM Implementation

It is also important to note that because the component Gaussians are acting together to

model the overall pdf, full covariance matrices are not necessary even if the features are not

statistically independent. So, the linear combination of diagonal covariance basis Gaussians is

capable of modeling the correlations between feature vector elements. In addition, the use ofdiagonal covariance matrices greatly reduces the complexity in computation. Hence in our

project, the mth covariance matrix is

Cm = diag ( am1 , am2,, amK),

Where,

amj,j = 1, 2,,K are the diagonal elements or variances

K=Number of features in each feature vector

The effect of a set of using a set of M full covariance Gaussians can be compensated by using

by using a larger set of diagonal covariance Gaussians (M=16 in our case). M=16 is best for

speaker Modeling, according to research papers.

The components pdfs can now be expressed as,

(/ ,) = 1

(2

).

,

exp{12(( )/,)

}

Where,m,j are the elements of m

th mean vector.3.2.1. Block diagram of GMM based Speaker Recognition System,

Feature

Extraction

Model training

Model DB

MatchingDecision

Accepted /RejectedSpeech

Enrollment

Verification

Figure 3.11: Block diagram of GMM based Speaker Recognition System


42/71

31

3.2.2. GMM Training

Given the training speech from a speaker, the goal of speaker model training is to estimate

the parameters of the GMM that best match the distribution of training features vectors and

hence develop a robust model for the speaker. Out of several techniques available for

estimating the parameters of GMM, the most popular method is Maximum Likelihood (ML)

estimation orExpectation-Maximization (EM).

It is a well-established maximum likelihood algorithm for fitting a mixture model to a set of

training data. EM requires an a priori selection of model order, the number of M components

to be incorporated into the model and initial estimate of training parameters before iterating

through the training.

The aim of the ML estimation method is to maximize the likelihood of GMM, given the

training data. Under the assumption of independent feature vectors, the likelihood of GMM,

for the sequence of T training vectors X = {,,,} can be written as,P(X/) =(/)

In practice, the above computation is done in log domain to avoid underflow. That is, instead

of multiplying lots of very small probabilities, we can simply add them in log domain.

Thus, the log-likelihood of a model for a sequence of feature vectors X = {,,, } iscomputed as follows:logP(X/) =1 log(/)

Note that in the above equation, the average log likelihood value is used so as to normalize

out duration effects from the log-likelihood value. Also, since the incorrect assumption of

independence is underestimating the actual likelihood value with dependencies, scaling by T

can be considered as a rough compensation factor.

The direct maximization of this likelihood function is not possible as it is a non-linear

function of the parameter. So, the likelihood function is maximized using ExpectationMaximization algorithm.

The basic idea of EM algorithm is beginning with the initial model , to estimate a newmodel such that P(X/) P(X/). The new model then becomes the initial


43/71

32

model for the next iteration and the process is repeated until some convergence threshold is

reached. i.e., P(X/) - P(X/)


44/71

33

= 1 The j-th diagonal element of Cdata can be estimated as

, =1

,

A measure of the volume that the training data occupies can be given by

=, Finally the covariance can be calculated as

,

=

(

)

For minimum covariance (threshold) value to avoid NaN (Not a Number) error during EMiterations,

= ( ) Covariance limiting was done as calculated above for each mixture. For simplicity we

initialized covariance values to be same for all gaussian components.

For Training the GMM parameters we used the following constants:

Number of Iterations:

MINIMUM_ITERATION = 100;

MAXIMUM_ITERATION = 500;

And

Minimum log likelihood change for Convergence:

LOGLIKELIHOOD_CHANGE = 0.000001;


45/71

34

3.2.3. Verification

After training section, now we have a complete model (GMM) of speakers. The speaker

verification task is a hypothesis testing problem where based on the input speech

observations, it must be decided whether the claimed identity of the speaker is correct or not.

So, the hypothesis test can be set as:

H0: the speaker is the claimed speaker

H1: the speaker is an imposter

The right decision between these two hypotheses is based on the likelihood ratio given by

P(X/)

P(X/

)

Where, P(X/) is the likelihood that the utterance was produced by speaker model while

P(X/) is the likelihood that he utterance was produced by imposter model .Here, the imposter model , also called as Universal Background Model (UBM), is obtained

by training a collection of speech samples from a large no. of speakers, representative of the

population of speakers.

The likelihood ratio is often expressed in logarithm as

() = log((/)(/)) = logP(X/) logP(X/)The decision is made as follows:

If() < T , reject null hypothesis i.e. the speaker is an imposter.If() > T , accept null hypothesis i.e. the speaker is the claimed one.

where, the threshold value T is set in suck a way that, the error of the system is minimum sothat the true claimants are always accepted and false claimants are always rejected.

3.2.4. Performance measure of Speaker Verification System

In general, the performance of the speaker verification system is determined by False

Rejection Rate (FRR) and False Acceptance Rate (FAR).


46/71

35

1) False Rejection Rate(FRR)

FRR is the measure of the likelihood that the system will incorrectly reject an access

attempted by an authorized user. A systems FRR typically is the ratio of the number of false

rejections divided by the number of verification tests.

2) False Acceptance Rate(FAR)

FAR is the measure of the likelihood that the system will incorrectly accept an access attempt

by an unauthorized user. A systems FAR usually is stated as the ratio of the number of false

acceptances divided by the number of verification tests.

Total Error Rate (TER) is the combination of false rejection and false acceptance rate. And

the requirement of the system is to minimize the Total Error Rate. These errors are dependent

on the choice of threshold value used during verification. It seems that, at lower thresholdvalue, FAR is predominant while at higher threshold value, FRR is predominant. This

dependency of the two errors can be seen in the figure below. At certain threshold value,

these errors are equal and TER is minimum.

Figure 3.12: Equal Error Rate (EER)


47/71

36

3.3 Implementation of HMM for Speech Recognition

The basic block diagram for isolated word recognition is given below:

Pre-

process

MFCC

Features

Vector Quantization

(VQ)HMM Recognition

CODEBOOK

HMMModel

SpeechSignal

K-means

Clustering

Recognition

Result

Baum-WelchAlgorithm

ViterbiAlgorithmDiscrete

Observation

Sequence

),,( BA

Figure 3.13: Speech Recognition algorithm flow

In order to do isolated word speech recognition, we must perform the following:

1) The codebook is generated using the feature vector of the training data and Vector

quantization uses the codebook to map the feature vector to discrete observation

symbol.

2) For each word v in the vocabulary, an HMM v is built, i.e., we must estimate the

model parameters (A, B, ) that optimize the likelihood of the training set observation

vectors for the vth word. In order to make reliable estimates of all model parameters,

multiple observation sequences must be used. Baum-Welch algorithm is used for

estimation of HMM parameters.

3) For each unknown word which is to be recognized, processing of some steps must be

carried out, namely measurement of the observation sequence O={O1,O2,..,OT}, via

feature analysis of the speech corresponding to the word, followed by calculation of

model likelihoods for all possible models, P(O| v), 1 v V; followed by selection

of the word whose model likelihood is highest.

=

max

[P(O|

]

The probability computation step is performed using the Viterbi algorithm and

requires on the order of V.N2.T computations.


48/71

37

Figure 3.14: Pronunciation model of word TOMATO

The above figure shows the pronunciation model of word tomato. The circles represent the

states and the numbers above the arrows represent transition probabilities. The pronunciation

of the same word may differ from person to person. The above figure reflects the two

pronunciation styles for the same word tomato. So, in order to best model each word, we

need to train the word for as large set of persons as possible so that it models all the variationin pronunciation for that word.

Vector Quantization:

HMM is used in speech recognition because a speech signal can be viewed as a piecewise

stationary signal or a short-time stationary signal. In a short-time speech can be approximated

as a stationary process. Each acoustic feature vector represents information such as the

amount of energy in different frequency bands at a particular point in time. The observation

sequence for speech recognition is a sequence of acoustic feature vectors (MFCC vectors)andthe phonemes are the hidden states. One way to make MFCC vectors look like symbols

that we could count is to build a mapping function that maps each input vector into one of a

small number of symbols. This idea of mapping input vectors to discrete quantized symbols

is called vector quantizationorVQ.

The type of HMM that models speech signals based on VQ technique to produce the

observations is called Discrete Hidden Markov Model (DHMM). However, VQ is

responsible for losing some information from the speech signal even when we try to increase

the codewords. This lose is due to the quantization error (distortion). This distortion can be

reduced by increasing the number of codewords in the codebook but cannot be eliminated.

The long sequence of speech samples will be represented by stream of indices representing

frames of different window lengths. Hence, VQ is considered as a process of redundancy

removal, which minimizes the number of bits required to identify each frame of speech


49/71

38

signal. In vector quantization, we create the small symbol set by mapping each training

feature vector into a small number of classes, and then we represent each class by a discrete

symbol. More formally, a vector quantization system is characterized by a codebook, a

clustering algorithm, and adistance metric.

Acodebookis a list of possible classes, a set of symbols constituting features F= {f1,f2, ...,fn}. All feature vector from training speech data are clustered into 256 classes thereby

generating a Codebook with 256 centroids with the help of K-Means clustering technique.

Vector Quantization (VQ) is used to get discrete observation sequence from input feature

vector by applying distance metric to Codebook.

Figure 3.15: Vector Quantization

As shown in the above figure, to make the feature vectors discrete, each incoming feature

vector is compared with each of the 256 prototype vectors in the codebook. And the one

which is closest (Euclidian distance) is selected, and then the input vector is replaced by theindex of corresponding centroid in codebook. In this way all continuous input feature vectors

are quantized to a discrete set of symbols.


50/71

39

3.3.1. Isolated Word Recognition

For isolated word recognition with a distinct HMM designed for each word in the vocabulary,

a left-right model is more appropriate than an ergodic model, since we can then associate

time with model states in a fairly straightforward manner. Furthermore we can envision the

physical meaning of the model states as distinct sounds (e.g., phonemes, syllables) of theword being modeled.

The issue of the number of states to use in each word model leads to two schools of thought.

One idea is to let the number of states correspond roughly to the number of sounds

(phonemes) within the word hence model with from 2 to 10 states would be appropriate.

The other idea is to let the number of states correspond roughly to the average number of

observations in a spoken version of the word. In this manner each state corresponds to an

observation interval i.e., about 15 ms for the analysis we use. The former approach is used

in our speech recognition system. Furthermore we restrict each word model to have the same

number of states; this implies that the models will work best when they represent words with

the same number of sounds.

Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of states N in the HMM

Above figure shows a plot of average word error rate versus N, for the case of recognition of

isolated digits (i.e., a 10-word vocabulary). It can be seen that the error is somewhat

insensitive to N, achieving a local minimum at N=6; however, differences in error rate for

values of N close to 6 are small.

The next issue is the choice of observation vector and the way it is represented. Since we are

representing an entire region of the vector space by a single vector, distortion penalty is


51/71

40

associated with VQ. It is advantageous to keep the distortion penalty as small as possible.

However, this implies a large size codebook, and that leads to problems in implementing

HMMs with a large number of parameters. Although the distortion steadily decreases as M

increases, only small decreases in distortion accrue beyond a value of M=32. Hence HMMs

with codebooks sizes of from M=32 to 256 vectors have been in speech recognition

experiments using HMMs. For the discrete symbol models we have used codebook to

generate the discrete symbols with M=256 codewords.

Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the size of the VQ, M (shown of

a log scale)

Another main issue is to initialize the parameters of HMM. The parameters that constitute

any model are , A, and B. The values of are given by = [1 0 0 0 0........0] because the

left-right model of HMM is used in our speech recognition system which always starts with

first state and ends in the last state. The random values between 0 and 1 are assigned as the

initial value to the elements of A and B parameters.

3.3.2. Application of HMM

Given the form of HMM, there are three basic problems of interest that must be solved for the

model to be useful in real-world applications. These problems are the following:


52/71

41

3.3.2.1 Evaluation Problem: Calculating Parameters

Given the observation sequence O = O1O2OT, and Markov Model = (A,B,) ,how do we efficiently compute P(O |) , the probability of the observation sequence, given

the model?

Solution:

The aim of this problem is to find the probability of the observation sequence, O = (O1, O2,

, OT ) given the model , i.e. P(O | ). Because the observations produced by states are

assumed to be independent of each other and the time t, the probability of observation

sequence, O = (O1, O2 , , OT ) being generated by a certain state sequence q can be

calculated by a product:

(

|

,

) =

(

).

(

) ..

(

)

And the probability of the state sequence, q can be found as:

(|,) =.. ..The aim was to findP(O | ), and this probability ofO (given the model ) is obtained by

summing the joint probability over all possible state sequence q, giving:

(|) =(|,).(|,) This direct computation has one major drawback. It is infeasible due to the exponential

growth of computations as a function of sequence length T. To be precise, it needs (2T-1)NT

multiplications and NT-1 additions. An excellent tool which cuts the computational

requirements to linear, relative to T, is the well-known forward algorithm. The forward

algorithm hasN(N+1)(T-1)+1 multiplications andN(N-1)(T-1) additions.

Forward Algorithm

Initially consider a new forward probability variable t(i) , at instant tand state i , has thefollowing formula:

t (i) P(O1, O2, O3, ......., Ot, qt Si/)

This probability function could be solved forNstates andTobservations iteratively:

Step 1: Initialization () =.()


53/71

42

Figure 3.18: Forward Procedure - Induction Step

Step 2: Induction

(

) =

(

)

(

)

,

Step 3: Termination

(|) =() This stage is just a sum of all the values of the probability function T(i)over all the states atinstant T. This sum will represent the probability of the given observations to be driven from

the given model. That is how likely the given model produces the given observations.

Backward Algorithm

This procedure is similar to the forward procedure but it takes into consideration the state

flow as if in backward direction from the last observation entity, instant T, till the first one,


54/71

43

instant 1. That means that the access to any state will be from the states that are coming just

after that state in time.

To formulate this approach let us consider the backward probability function t (i)which canbe defined as:

(

) =

(

,

,

|

=

,

)

Figure 3.19: Backward Procedures - Induction Step

In analogy to the forward procedure we can solve fort(i) in the following two steps:1 - Initialization: () =, These initial values for s of all states at instant Tis arbitrarily selected.2 Induction:

() =.() .() = , ,,,


55/71

44

3.3.2.2 Decoding Problem: Finding the best path

Given the observation sequence O = O1O2OT , and Markov Model = (A,B,), findoptimal state sequence q = q1q2 qT .

Solution:

The problem is to find the optimal sequence of states, given the observation sequence and the

model. This means that we have to find the optimal state sequence Q= ( q1 , q2 , q3,....., qT-1,

qT) associated with the given observation sequence O = (O1 , O2, O3,........., OT-1 , OT )

presented to the model = (A , B , ). The criteria of optimality here is to search for a singlebest state sequence through modified dynamic programming technique called Viterbi

Algorithm.

To explain the Viterbi Algorithm, the probability quantity t (i) is defined which representsthe maximum probability along the best probable state sequence path of a given observation

sequence aftertinstants and being in state i. This quantity can be defined mathematically by:

() = ,, [,,, = , |]The best state sequence is backtracked by another function t (j). The complete algorithm can

be described by the following steps:

Step 1: Initialization: () =(), () =Step 2: Recursion:

() =()(), ,

(

) =

[

(

)

]

(

)

,

,

Step 3: Termination: =[()] =[()]Step 4: Path (state sequence) backtracking:

=( ),


56/71

45

Viterbi Algorithm can also be used to calculate theP(O/) approximately by considering theuse ofP*instead.

Figure 3.20: Viterbi Search

3.3.2.3 Training Problem: Estimating the Model Parameters

Given the observation sequence O = O1O2OT , estimate parameters for Model = (A,B,)that maximizeP(O |) .

Solution:

This problem deals with the training issue which is the most difficult one in all the three

cases. The task of this problem is to adjust the model parameters, (A, B, ), according to acertain optimality criteria. Baum-Welch Algorithm (ForwardBackward Algorithm) is one of

the well-known techniques to solve the problem. It is an iterative method to estimate the new

values for the model parameters. To explain the training procedure, first a posteriori

probability function t(i) is defined, the probability of being in state i at instant t, given the

observation sequence Oand the model as:() =( = |,)() =(, = |)(|) () = ()() ()()


57/71

46

Then another probability function t (i, j) is defined, the probability of being in state i atinstantt and going to state j at instantt+1, given the model and the observation sequence O.t (i, j)can be mathematically defined as:

(,) =( = , = |,)

Figure 3.21: Computation of t (i, j)

From the definition of the forward and backward variables, we can write t (i, j)in the form

(,) = ()()() ()()

(

,

) =

(

)

(

)

(

)

()()()

The relation between t(i) andt (i, j)can be easily deduced from their definitions :

() =(,) Now, ift(i) is summed over all instants (excluding instant T) we get the expected number of

times that state Si has left, or the number of times this state has been visited over all instants.


58/71

47

On the other hand if we sum t (i, j)over all instants (excluding T) we will get the expectednumber of transitions that have been made from i toj.

From the behavior oft(i)andt (i, j)the following re-estimations of the model parameterscould be deduced:

Initial state distribution:

)( 1 ii

Transition probabilities:

1

1

1

1

)(

),(

T

t

t

T

t

t

ij

i

ji

a

Emission probabilities:

j

kj

sstateintimesofnumberexpected

vsymbolobservingandsstateintimesofnumberexpected)( kbj

3.3.3. Scaling

t(i) consists of the sum of a large number of terms. Since transition matrix element (a) and

emission matrix element (b) are less than 1, as t starts to get big, each term of t(i) starts to

head exponentially to zero. For large t the dynamic range of t(i) computation will exceed

the precision range of computer (even in double precision ). This is accomplished bymultiplying t(i) and t(i) by a scaling factor that is independent of i (i.e., it depends only on

t), with the goal of keeping the scaled t(i) within the dynamic range of the computer for 1

t T . Then at the end of the computation, the scaling coefficients are canceled out exactly.

When using the Viterbi Algorithm, if logarithms are used to give the maximum likelihood

state sequence, no scaling is required.

1tat timesstateintimesofnumberexpected i i

i

ji

sstatefromnstransitioofnumberexpected

sstatetosstatefromnstransitioofnumberexpected ija

T

t

t

T

vot

t

i

j

j

kb kt

1

1

)(

)(

)(


59/71

48

+execute()

interfaceAlg ori th m

+doPreprocessing()+doPCMNormalization()

-capturedSignal-processedSi

Documents

TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION - Project Report - GANESH TIWARI - IOE - TU