Reporter: Shih-Hsiang( 士翔 )

Reporter: Shih-Hsiang(士翔 )

Introduction

• Speech signal carries information from many sources– Not all information is relevant or important for speech recognition– Feature extraction (the first crucial step)

• Acoustic features may greatly affect the performance of a speech recognizer– Discriminability– Robustness– Complexity

• MFCCs are used almost as “standard” acoustic parameters in currently available speech recognition systems– Not to cope well with noisy speech– Wiener filtering, spectral subtraction, RASTA, PMC, MLLR …etc.

• In this paper, they present differential power spectrum (DPS) for speech recognition

Definition of the differential power spectrum

y(t) : received speech signals(t) : original clean speech signalh(t) : impulse response of the transmission channelx(t) : the noise-free speech signalv(t) : ambient noise

(0≤n<N, where N is the frame length)

assume

power spectrum

ω : radian frequencyry(τ) : the short-time autocorrelation

Definition of the differential power spectrum (cont.)

assume noise and speech signal are mutually uncorrelated

Differential power spectrum (DPS)

assume noise and speech signal are mutually uncorrelated

(continuous frequency domain)


Its discrete counterpart can be approximated in terms of following difference equation

where P and O are the orders of the differential equation bl’s some real-valued weighting coefficients 0≤k<K, here K is the length of FFT


D(k) = Y(k) – Y(k+1)

Representing DPS into speech features

• Three problems– The selection of proper orders of the difference equations

– The determination of weights bl’s

– How DPS should be converted into a few parameters

• An optimal solution to any of the three listed problems is difficult to achieve

• For the first two problems, they proposed three special forms– DPS1: D(k) = Y(k) – Y(k+1)– DPS2: D(k) = Y(k) – Y(k+2)– DPS3: D(k) = Y(k-2) + Y(k-1) – Y(k+1) - Y(k+2)

• The third problem is converting DPS into cepstral coefficients– An absolute operation to make negative parts positive– The magnitude of DPS is passed through a mel-frequency filter bank– Logarithmic filter bank outputs are compressed into a feature vector

Representing DPS into speech features (cont.)

Comparison with the cepstral liftering technique

• If xi is the i-th cepstral coefficient, then the corresponding liftered cepstral coefficient is given by

• Various types of lifters are proposed in the literatureiii xwy where Wi define the lifter

iwrLinerLifte i :

iiwifterStaticialL ̂/1:

)sin(2

1:D

iDwLifterSinusoidal i

)2

exp(:2

2

r

iiwlLifterExponentia s

i

Comparison with the cepstral liftering technique (cont.)

Type of lifterSNR in dB

∞ 30 25 20 15

No Lifter 93.0 70.6 55.9 37.2 24.0

Lin. Lifter 94.0 90.8 86.6 80.1 70.6

Stat. Lifter 93.9 86.7 78.3 68.3 55.3

Sin. Lifter 94.5 85.9 78.9 68.7 51.5

Exp. Lifter 94.3 90.1 85.1 78.9 68.1

Effect of cepstral liftering on the performance of a DTW-based speech recognizer


• But liftering has no effect in the recognition process

)ˆ(ˆ)ˆ()ˆ,ˆ;( 1 xxxxxxd xt

y Mahalanobis distance - HMM

)ˆ(ˆ)ˆ()ˆ,ˆ;( 1 yyyyyyd yt

y Mahalanobis distance liftered cepstral cofficients are used

Weighted Matrix

t

xy WWxWyWxy ˆˆ,ˆˆ, 　　　　　　

)ˆ,ˆ;()ˆ,ˆ;( yx yydxxd


• In DPS based cepstrum

Comparison with the spectral subtraction

• SS can be formulated as

• For speech recognition, it was found that SS operated in each band-pass filter could yield more consistent improvement for MFCC features against noise

β :spectral flooringα:controls the amount of noise subtracted from the noisy signal

EY(k) is the output of the kth band-pass filter when Y(k) is passed though the filter

attackdecay

Experiments

• In this paper they conduct a number of speech recognition experiments– Isolated speech recognition– SNR improvement– Connected digits recognition– Phone recognition– Evaluation on AURORA task

Experiments - Isolated speech recognition

• TI46 database – an isolated spoken words database (TI)– 16 speakers (8 males / 8 females)– Vocabulary consists

• 10 isolated digits from ‘ZERO’ to ‘NINE’

• 26 isolated English alphabets from ‘A’ to ‘Z’

• 10 isolated words including “ENTER, ERASE, GO, HELP, NO, RUBOUT, REPEAT, STOP, START, YES”

– 26 utterances of each word from each speaker (10 training /16 testing)

• In this experiment, four sets of features are considered– MFCC– DPSCC1– DPSCC2– DPSCC3

Experiments - Isolated speech recognition (cont.)

• The DPS based features can at least yield comparable performance as the standard MFCCs

• For both MFCCs and DPSCCs, the inclusion of dynamic and acceleration features can greatly augment the performance

Experiments - SNR improvement

• Clean speech signals are taken from the TI46 database• Take Lynx noise from the NOISEX database• Power spectrum based

• DPS based

Experiments - SNR improvement (cont.)

Tge average SNRD is approximately 4 dB higher than SNRY

Experiments - Connected digits recognition

• TI connected digits database – contains digits string uttered by adult and child speakers– Vocabulary consists

• 11 words - 10 digits and an “oh”

– Each speaker uttered 77 sequences of these words

• Add some noise to the speech signal in the test set, and the training speech is kept clean– wide-band stationary speech noise, machine-gun noise, Lynx noise

• Four sets of feature vectors are investigated– MFCC– DPSCC– MFCC + CMN– DPSCC + CMN

Experiments - Connected digits recognition (cont.)

• Compared with MFCCs, it yields at least comparable performance in clean conditions

• In most strong noise conditions, DPSCC outperforms MFCC

• CMN is effective to augment the robustness of both

Experiments - Phone recognition

• TIMIT phoneme based continuous speech database– Contains a total of 6300 sentences– 10 sentences spoken by each of 630 speakers from 8 major dialect

regions of the US– Perform phonetic recognition on the database over the set of 39

classes that are commonly used for evaluation

• Add some noise to the speech signal in the test set, and the training speech is kept clean– wide-band stationary speech noise, machine-gun noise, Lynx noise

• Two feature sets are used– MFCC+CMN (39 coefficients)– DPSCC+CMN (39 coefficients)

Experiments - Phone recognition (cont.)

• The MFCC and the DPSCC features yield comparable result in clean and weak noise conditions.

• DPSCC features slightly outperform the MFCC features in strong noise conditions

Experiments - Evaluation on AURORA task

• Noise signals are recorder at different places– suburban train, babble, car, exhibition hall, restaurant, street, airport a

nd train station

• Two training modes are defined– Training on clean data only

• 8440 utterances (55 male / 55 female)

• Signals are filtered with the G.712 characteristic without noise added

– Training on clean as well as noisy data (multi-condition)• 8440 utterances and split into 20 subsets (with 422 utterances)

• Suburban train, babble, car, and exhibition hall noises are added to 20 subsets at 5 different SNRs (20, 15, 10, 5 dB and the clean condition)

• Three test sets are defined– Test Set A 、 Test Set B 、 Test Set C

Experiments - Evaluation on AURORA task (cont.)

• With the use of CMN, the average word error rate is reduced 8.8%

• SS used together with the CMN, it increases the average performance by 19.3%

• The DPS based cepstrum outperforms MFCC. It also yields a slightly better performance than SS

Discussion and conclusion

• DPS can also preserve spectral information to discriminate among different linguistic units (e.g. phonemes and words)

• DPS had a higher SNR than the power spectrum, specially for voiced frames– DPS based features should be more resilient to noise than the power

spectrum based feature

• The DPSCC can yield at least comparable performance when compared to the conventional MFCCs.– In most cases, it outperforms MFCC

• Compared to the estimation of MFCC, the extraction of DPSCC requires (K/2-1) more addition (subtraction) and absolute operations for each frame signal– This increase in computational complexity is negligible for today’s

computer

Documents

Reporter: Shih-Hsiang( 士翔 )