PCS Research & Advanced Technology Labs Speech Lab How to deal with the noise in real systems? Hsiao-Chun Wu Motorola PCS Research and Advanced Technology

PCS Research & Advanced Technology Labs

Speech Lab

How to deal with the noise in real systems?

Hsiao-Chun Wu

Motorola PCS Research and Advanced Technology Labs, Speech Laboratory

[email protected]

Phone: (815) 884-3071


Speech Lab November 14, 2000

Why do we need to study noise?

Noise exists everywhere. It affects the performance of signal

processing in reality. Since the noise cannot be avoided by system

engineers, modern “noise-processing” technology has been

researched and designed to overcome this problem. Hence many

related research areas have been emerging, such as signal detection,

signal enhancement/noise suppression and channel equalization.



• Spectral Truncation – Spectral Subtraction (1989):

• Time Truncation– Signal Detection:

• Spatial and/or Temporal Filtering– Equalization:

– Array Signal Separation (Blind Source Separation):

How to deal with noise? Cut it off!!!!

)()(~

)()()(~

)(

fSfNfNfSfS

fR

noiseTnr ),()(:~

)()()()()(~

)(

tststhtwts

tr

)()()()()(~

)(

tStStHtWtS

tR



Session 1. On-line Automatic End-of-speech Detection Algorithm (Time Truncation)

1. Project goal.

2. Review of current methods.

3. Introduction to voice metric based end-of-speech detector.

4. Simulation results.

5. Conclusion.



1. Project Goal:

• Problem

– Digit-dial recognition with unknown digit string length

• Solution 1

– fixed length window such as 10 seconds? (inconvenience to users)

• Solution 2

– Dynamic termination of data capture? (need a robust detection

algorithm)



• Research and design a robust dynamic termination mechanism for speech

recognizer.

– a new on-line automatic end-of-speech detection algorithm with small

computational complexity.

• Design a more robust front end to improve the recognition accuracy for

speech recognizers.

– a new algorithm can also decrease the excessive feature extraction of redundant

noise.



2. Review of Current Methods:

Most speech detection algorithms can be characterized into three categories.

• Frame energy detection– short-term frame energy (20 msec) can be used for speech/noise

classification.

– it is not robust at large background noise levels.

• Zero-crossing rate detection– short-term zero-crossing rate can also be used for speech/noise

classification.

– it is not robust in a wide variety of noise types.

• Higher-order-spectral detection– short-term higher-order spectra can be used for speech/noise

classification.

– it implies a heavy computational complexity and its threshold is difficult to be pre-determined.



3. Introduction to Voice Metric Based End-of-speech Detector:

• End-of-speech detection using voice metric features is based on the Mel-

energies. Voice metric features are robust over a wide variety of background

noise. Originally voice metric based speech/noise classifier was applied for

IS-127 CELP speech coder standard. We modify and enhance voice-metric

features to design a new end-of-speech detector for Motorola voice

recognition front end (VR LITE III).















voice metric score table



Pre-S/NClassifier

VoiceMetric

Mel-Spectrum

SNREstimate

EOSBuffer

ThresholdAdaptation

raw dataFFT

Speech Start?

Silence Duration

Threshold

Post-S/NClassifier

voice metric scores

Original VR LITE Front End

End-of-speech Detector data capture stops

yes

no



VR LITE recognition engine

feature vector frame buffer

segmentation of speech into frames

data capture terminates

end of speech?

yes

noframe inext frame i+1

speech input

front end with end-of-speech detector



6.51 seconds

3.78 seconds

4.81 seconds

raw data

end point

detected end point

String “2-2-9-1-7-8” in Car 55 mph



Correct detection

End point

False detection

false detection time error

correct detection time error

String “2-2-9-1-7-8” in Car 55 mph

seconds



4. Simulation Results: (Simulation is done over Motorola digit-string database, including 16 speakers and 15,166 variable-length digit strings in 7 different conditions. Silence threshold is 1.85 seconds.)

A. Receiver Operating Curve (ROC): ROC curve is the

relationship between the end-of-speech detection rate versus the

false (early) detection rate. We compare two different methods,

namely, (1) new voice-metric based end-of-speech detector and

(2) old speech/noise flag based end-of-speech detector.



ROC curve

false detection rate (%)

dete

ctio

n ra

te (

%)



• B. String-accuracy-convergence (SAC) curve: SAC

curve is the relationship between the string recognition accuracy

versus the false (early) detection rate. We compare two different

methods, namely, (1) new voice-metric based end-of-speech

detector and (2) old speech/noise flag based end-of-speech

detector.



false detection rate (%)

stri

ng r

ecog

nitio

n ac

cura

cy (

%)

SAC curve



C. Table of detection results: (This table illustrates the result among the Madison sub-database including data files with 1.85 seconds or more of silence after end of speech.)

Condition AverageTime Error

AverageFalseDetectionTime Error

AverageCorrectDetectionTime Error

FalseDetectionRate

StringNumbers

TotalDetectionRate

Overall 1.98 sec 1.68 sec 1.85 sec 0.47% 7,418 86.08%OfficeClose-talk

1.97 sec 0 sec 1.93 sec 0% 907 94.82%

OfficeArm-length

1.98 sec 0 sec 1.93 sec 0% 988 93.62%

CaféClose-talk

2.17 sec 0 sec 2.00 sec 0% 1,147 81.87%

Café Arm-length

2.31 sec 0.14 sec 2.00 sec 0.11% 898 57.57%

Car Idle(HF)

1.91 sec 1.02 sec 1.84 sec 0.08% 1,210 93.97%

Car35mph(HF)

1.93 sec 0.96 sec 1.77 sec 0.71% 1,130 87.61%

Car 55mph (HF)

1.66 sec 2.00 sec 1.59 sec 2.20% 1,138 89.63%



(This table illustrates the result over the small database collected by Motorola PCS CSSRL. All digits strings are recorded in 15 seconds of fixed window)

Condition Average Time Error

Average False

Detection Time Error

Average Correct

Detection Time Error

False Detection

Rate

String Numbers

Total Detection

Rate

String Recognition

Accuracy

(w/i EOS)

String Recognition

Accuracy

(w/o EOS)

Overall 1.82 seconds

0 seconds 1.82 seconds

0% 121 96.69% 50.41% 29.75%

Office Close-talk

1.85 seconds


0% 21 100% 66.67% 61.90%

Office-Arm-length

1.84 seconds


0% 20 100% 65.00% 65.00%

Café Close-talk

1.76 seconds


0% 40 100% 40.00% 15.00%

Café Arm-length

1.85 seconds


0% 40 90% 45.00% 10.00%



Analysis of the Simulation Result: Why didn’t EOS detection work well in babble noise?



Optimal Detection Decision

• Bayes classifier

• Likelihood Ratio Test

)]|(log[)]|(log[ xnf

H

H

xnsf

n

s

])(

)(log[,)(

)]|(log[)]|(log[)(

nsf

nfTT

H

H

xL

nxfnsxfxL

BayesBayes

n

s



Digit “one” in close-talking mic, quiet office



Digit “one” in handsfree mic, 55 mil/h car



Digit “one” in far-talking mic, cafeteria



5. Conclusion:• New voice-metric based end-of-speech detector is robust over a wide

variety of background noise.

• Only a small increase in the computational complexity will be brought by

new voice-metric based end-of-speech detector and it can be real-time

implementable.

• New voice-metric based end-of-speech detector can improve recognition

performance by discarding extra noise due to the fixed data capture

window.

• New voice-metric based end-of-speech detector needs further improvement

in the babble noise environment.



Session 2. Speech Enhancement Algorithms: Blind

Source Separation Methods (Spatial and Temporal Filtering)

1. Motivation and research goal.

2. Statement of “blind source separation” problem.

3. Principles of blind source separation.

4. Criteria for blind source separation.

5. Application to blind channel equalization for digital

communication systems.

6. Simulation and comparison.

7. Summary and conclusion.



1. Motivation:

• Mimic human auditory system to differentiate the subject signals from other sounds, such as interfered sources, background noise for clear recognition of the subject contents.

• ‘One of the most striking facts about our ears is that we have two of them--and yet we hear one acoustic world; only one voice per speaker.’ (E. C. Cherry and W. K. Taylor. Some further experiments on the recognition of speech, with one and two ears. Journal of the Acoustic Society of America, 26:554-559, 1954)

• The ‘‘cocktail party effect’’--the ability to focus one’s listening attention on a single talker among a cacophony of conversations and background noise--has been recognized for some time. This specialized listening ability may be because of characteristics of the human speech production system, the auditory system, or high-level perceptual and language processing.



Research Goal:

Design a preprocessor with digital signal processing speech

enhancement algorithms. The input signals are collected through

multiple sensor (microphone) arrays. After the computation of

embedded signal processing algorithms, we have clearly separated

signals at the output.



Audio Input

Blind Source Separation Algorithms

Enhanced Output



2. Problem Statement of Blind Source Separation:

What is “Blind Source Separation”?

Sensor 1 Sensor N

Signal 1 Signal M

Received input signals

Sensor 1 Sensor N

Signal 1 Signal M

Received input signals

Given the N linearly mixed received input signals, we need to recover the M statistically independentsources as much as possible ( ).MN



Formulation of Blind Source Separation Problem:

A received signal vector from the array, X(t), is the original source vector S(t)

through the channel distortion H(t), such that X(t) = H(t) S(t), where

and

We need to estimate a separator W(t) such that

where

TMT

N tststStxtxtX )()()(,)()()( 11

)()(

)(

)()(

)(

1

111

thth

th

thth

tH

NMN

ij

M

)()(00)()()(~

1 tXtWtststS TM

)()(

)(

)()(

)(

1

111

twtw

tw

twtw

tW

NNN

pq

N



3. Principles of Blind Source Separation:

The independence measurement: Shannon’s Mutual information.

0),,,()(),,,( 211

21

N

N

iiN yyyHyHyyyI

y

iiyNYN yfEyyyfEyyyI

i1

2121 )]}({log[)]},,,({log[),,,(



4. Criteria to Separate Independent Sources:

• Constrained Entropy (Wu, IJCNN99):

–

• Hardamard Measure (Wu, ICA99):

–

• Frobenius Norm (Wu, NNSP97):

–

• Quadratic Gaussianity (Wu, NNSP99):

–

N

iiiiyfWJ

101 )],,(log[])det(log[

)][log(}])[(log{2TT YYEYYEdiagJ

23 ])[(][

F

TT YYEdiagYYEJ

iiGiY dyyfyfJi

24 )()(



We apply the minimization of modified constrained entropy

to adapt an equalizer w(t) =[w0, w1, ....] for

a digital channel h(t). Assume a PAM signal constellation with symbols s(t) = , passing through a digital channel h(t) = [c(t, 0.11) + 0.8c(t-1, 0.11) - 0.4c(t-3, 0.11)]W6T(t),

where is raised-cosine function with

roll-off factor and is a rectangular window. the input signal

to the equalizer is where n(t) is the background noise. We

applied generalized anti-Hebbian learning to adapt w(t)

such that .

5. Application to Blind Single Channel Equalization for Digital Communication Systems:

N

iiii

N yfwJ1

01 )],,(log[)log(

1

2

2241

)cos()(sin),(

T

tTt

Tt

ctc

)()()( tthtw

)6

(6 T

trectW T

)()()()( tnsthtx



Signal-to-noise Ratio (dB)

Sign

al-t

o-i n

t erf

eren

ce R

atio

(d B

)



Signal-to-noise Ratio (dB)

Bit

Err

or R

ate



6. Simulation and Comparison:

The simulation results for comparison among our generalized

anti-Hebbian learning, SDIF algorithm and Lee’s Informax method

(Lee IJCNN97) over three real recordings downloaded from Salk

Institute, University of California at San Diego.



New VR LITE Frontend: Blind Source Separation + End-of-speech Detection

schemes AverageDetection

Time Error

AverageFalse

DetectionTimeError

AverageCorrect

DetectionTimeError

Number ofStrings

FalseDetection

Rate

TotalDetection

Rate

EOSonly

0.256seconds

0.155seconds

0.317seconds

14 7.14% 42.86%

BSS+EOS

0.236seconds

0.125seconds

0.322seconds

14 7.14% 50.00%



7. Conclusion and Future Research:

• The computational efficiency of blind source separation needs to

be reduced.

• Test BSS for EOS detection under microphone arrays of the same

kind.

• Incorporate other array signal processing (beamformer?)

technique to improve speech detection and recognition.

Documents

PCS Research & Advanced Technology Labs Speech Lab How to deal with the noise in real systems? Hsiao-Chun Wu Motorola PCS Research and Advanced Technology