Ala’a Spaih Abeer Abu-Hantash Directed by

Ala’a Spaih Abeer Abu-Hantash Ala’a Spaih Abeer Abu-Hantash Directed byDirected by

Dr.Allam MousaDr.Allam Mousa

Outline for TodayOutline for Today

Speaker Recognition Field1.

System Overview2.

MFCC & VQ3.

Experimental Results4.

Live Demo5.

Speaker Recognition FieldSpeaker Recognition Field

Speaker Recognition

Speaker Verification Speaker Identification

Text

Dependent

Text

Independent

Text

Independent

Text

Dependent

System OverviewSystem Overview

Speech input

Feature extraction

Training

modeSpeaker modeling

FeatureMatching

SpeakerModel

Database

DecisionLogic

SpeakerID

Testing

Mode

Feature ExtractionFeature Extraction

Feature extraction:is a special form of dimensionality reduction.

The aim: is to extract the formants.

Feature ExtractionFeature Extraction

The extracted features must have specific characteristics:

Easily measurable, occur naturally and frequently in speech.

Not change over time.

Vary as much among speakers, consistent for each speaker.

Not affected by: speaker health, background noise.

Many algorithms to extract them:

LPC,LPCC,HFCC,MFCC.

We used Mel Frequency Cepstral Coefficients algorithm:

MFCC.

Feature Extraction Using MFCCFeature Extraction Using MFCC

Input speechFraming and windowing

Fast Fourier transform

Absolute value

Mel scaled-filter bank

Log

Discrete cosine transformFeature vectors

Framing And WindowingFraming And Windowing

FFT

Spectrum

Vocal tract

Glottal

pulse

Mel Scaled-Filter BankMel Scaled-Filter Bank

Spectrum

mel(f)= 2595*log10(1+f/700)

Mel

spectrum

CepstrumCepstrum

Melspectrum

MFCC

Coeff.

DCT of the logarithm of the magnitude spectrum, the glottal pulse and the impulse response can be separated.

ClassificationClassification

Classification, that is to build a unique model for each speaker in the database.

Two major types of models for classification.

Stochastic models:GMM,HMM,ANN

Template models:VQ , DTW

We used VQ algorithm.

VQ AlgorithmVQ Algorithm

The VQ technique consists of extracting a small number of representative feature vectors.

The first step is to build a speaker-database consisting of N codebooks, one for each speaker in the database.

SpeakerFeature vectors

Clustered into

codewords

Speaker model

(codebook)

This done by

K-means

Clustering

algorithm

K-means ClusteringK-means Clustering

start

No. of clusters k

centroids

Distance objects to centroids

Grouping based on minimum distance

No change End

Noyes

VQ Example

Given data points, split into 4 codebook vectors with initial values at (2,2),(4,6),(6,5),(8,8).

VQ Example

Once there’s no more change, the feature space will be partitioned into 4 regions. Any input feature can be classified as belonging to one of the 4 regions. The entire codebook can be specified by the 4 centroid points.

If we set the codebook size to 8 then the output of the clustering will be:

K-means ClusteringK-means Clustering

0 2 4 6 8 10 12-8

-6

-4

-2

0

2

4

6

8

10

0 2 4 6 8 10 12-6

-4

-2

0

2

4

6

8

VQ

MFCC’s of a speaker (1000x12) Speaker Codebook (8x12)

Feature Matching

d2(x,y) (x i y i)2

i1

D

For each codebook a distortion measure is computed.The speaker with the lowest distortion is chosen. Define the distortion measure Euclidean distance.

System Operates In Two ModesSystem Operates In Two Modes

OfflineOffline

OnlineOnline

Monitoring Microphone

Inputs

MFCCFeature

Extraction

Calculate VQ

Distortion

Make Decision &

Display

Applications

Speaker Recognition for Authentication. Banking application.

Forensic Speaker Recognition Proving the identity of a recorded voice can help to convict a criminal or

discharge an innocent in court.

Speaker Recognition for Surveillance. Electronic eavesdropping of telephone and radio conversations.

ResultsResults

To show how the system identify the speaker according to Euclidean distance calculation.

Sp 1 Sp 2 Sp 3 Sp 4 Sp 5

Sp 1 10.7492 13.2712 17.8646 14.7885 13.2859

Sp 2 13.2364 10.2740 13.2884 11.7941 14.0461

Sp 3 17.5438 16.1177 11.9029 16.2916 17.7199

Sp 4 16.1360 13.7095 15.5633 11.7528 16.7327

Sp 5 14.9324 15.7028 17.2842 17.8917 12.3504

12 MFCC, 29 Filter banks, 64 Codebook size … ELSDSR database.

Results

Number of MFCC Vs. ID rate.

No. of

MFCC

ID

Rate

5 76 %

12 91 %

20 91 %

Frame Size Vs. ID rate.

Frame size(10-30) ms Good

Above 30 ms Bad

Results Results

The effect of the codebook size on the ID rate & VQ distortion.

82

84

86

88

90

92

94

96

98

100

0 50 100 150 200 250 300

Codebook Size

ID ra

te (%

)

0

2

4

6

8

10

12

14

0 50 100 150 200 250 300

Codebook Size

Mat

chin

g S

core

ResultsResults

Number of filter-banks Vs. ID rate & VQ distortion.

0%

20%

40%

60%

80%

100%

120%

0 10 20 30 40 50

Number of Filters in Filter-Bank

ID ra

te (%

)

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50

Number of Filters in Filter-bank

Mat

chin

g Sc

ore

ResultsResults

The performance of the system on different test shot lengths.

Test speech length

ID

Rate

0.2 sec 60 %

2 sec 85 %

6 sec 90 %

10 sec 95 %

0

20

40

60

80

100

0 2 4 6 8 10 12

Test Speech Length (sec)

ID r

ate

(%)

Summary

Effect of changing some parameters on: MFCC algorithm. VQ algorithm.Our system identify the speaker regardless of the

language and the text.Satisfied results: The same training and testing environment. Test data needs to be several ten seconds.

Documents

Ala’a Spaih Abeer Abu-Hantash Directed by