Demo End of Speech

8/16/2019 Demo End of Speech

1/45

PCS Research & Advanced Technology Labs

Speech Lab

How to deal with the noise in real systems?

Hsiao-Chun Wu

Motorola PCS Research and Advanced

Technology Labs, Speech Laboratory

[email protected]

Phone: (815) 884-3071


2/45


Speech Lab November 14, 2000

Why do we need to study noise?

Noise exists everywhere. It affects the performance of signal

processing in reality. Since the noise cannot be avoided by system

engineers, modern “noiseprocessing! technology has been researched

and designed to overcome this problem. "ence many related research

areas have been emerging, such as signal detection, signal

enhancement#noise suppression and channel e$uali%ation.


3/45



& Spectral 'runcation

( Spectral Subtraction )*++-

& 'ime 'runcation ( Signal /etection

& Spatial and#or 'emporal 0iltering

( 1$uali%ation ( 2rray Signal Separation )3lind Source Separation-

"ow to deal with noise? 4ut it off5555

)()(~

)()()(~

)(

f S f N f N f S f S

f R

≈

noiseT nr ∈≈ τ τ τ τ -,)-)6

)()()()()(~

)(

t st st ht wt s

t r

≈

)()()()()(~

)(

t S t S t H t W t S

t R

≈


4/45



Session 1. On-line Automatic End-of-speech Detection

Algorithm (Time Truncation)*. 7ro8ect goal.

9. :eview of current methods.;. Introduction to voice metric based endofspeech

detector.


5/45



1. Project Goal:

• Problem

– Digit-dial recognition with unknown digit string length

• Solution 1

– fixed length window such as 10 seconds? (inconvenience to users)

• Solution 2

– Dynamic termination of data capture? (need a robust detection

algorithm)


6/45



• Research and design a robust dynamic termination mechanism for speech

recognizer.

– a new on-line automatic end-of-speech detection algorithm with small

computational complexity.

• Design a more robust front end to improve the recognition accuracy for

speech recognizers.

– a new algorithm can also decrease the excessive feature extraction of redundant

noise.


7/45



2. Review of Current Methods:

Most speech detection algorithms can be characterized into three categories.

• Frame energy detection

– short-term frame energy (20 msec) can be used for speech/noise

classification.

– it is not robust at large background noise levels.

• Zero-crossing rate detection– short-term zero-crossing rate can also be used for speech/noise

classification.

– it is not robust in a wide variety of noise types.

• Higher-order-spectral detection

– short-term higher-order spectra can be used for speech/noiseclassification.

– it implies a heavy computational complexity and its threshold is

difficult to be pre-determined.


8/45



3. Introduction to Voice Metric Based End-of-speech

Detector:

• End-of-speech detection using voice metric features is based on the Mel-

energies. Voice metric features are robust over a wide variety of background

noise. Originally voice metric based speech/noise classifier was applied for

IS-127 CELP speech coder standard. We modify and enhance voice-metric

features to design a new end-of-speech detector for Motorola voice

recognition front end (VR LITE III).


9/45




10/45




11/45




12/45




13/45




14/45



CS h & Ad d h l b


15/45



voice metric score table

PCS R h & Ad d T h l L b


16/45



7reS#N4lassifier

>oiceetric

elSpectrum

SN: 1stimate

1@S

3uffer

'hreshold

2daptation

raw data 00'

Speech

Start?

Silence

/uration

'hreshold

7ostS#N

4lassifier

voice metric scores

@riginal >: AI'1 0ront 1nd

1ndofspeech /etector data capture stops

yes

no



17/45



>: AI'1

recognitionengine

feature vector

frame buffer

segmentation

of speech intoframes

data capture

terminates

end of

speech?

yes

no frame i net frame i!*

speech

input

front end

with endofspeech

detector



18/45



B.=* seconds

;.C seconds


19/45



4orrect

detection

1nd

point

0alse

detection

falsedetection

time error

correctdetection

time error

String “2-2-9-1-7-8” in Car 55 mph

seconds



20/45



4. Simulation Results: (Simulation is done over Motorola digit-stringdatabase, including 16 speakers and 15,166 variable-length digit strings in 7

different conditions. Silence threshold is 1.85 seconds.)A. Receiver Operating Curve (ROC): ROC curve is the

relationship between the end-of-speech detection rate versus the

false (early) detection rate. We compare two different methods,

namely, (1) new voice-metric based end-of-speech detector and

(2) old speech/noise flag based end-of-speech detector.



21/45



:@4 curve

false detection rate )D-

d e t e c t i o n r

a t e ) D -



22/45



• B. String-accuracy-convergence (SAC) curve: SAC

curve is the relationship between the string recognition accuracy

versus the false (early) detection rate. We compare two different

methods, namely, (1) new voice-metric based end-of-speech

detector and (2) old speech/noise flag based end-of-speech

detector.



23/45



false detection rate )D-

s t r i n g r e c o g n i t i o n

a c c u r a c y ) D -

S24 curve



24/45



C. Table of detection results: (This table illustrates the result amongthe Madison sub-database including data files with 1.85 seconds or more of

silence after end of speech.)

4ondition 2verage

'ime 1rror

2verage

0alse/etection

'ime 1rror

2verage

4orrect/etection

'ime 1rror

0alse

/etection:ate

String

Numbers

'otal

/etection:ate

@verall 1.98 sec 1.68 sec 1.85 sec 0.47 7!418 86.08@ffice

4losetalE

*.+C sec F sec *.+; sec FD +FC +


25/45



(This table illustrates the result over the small database collected by Motorola

PCS CSSRL. All digits strings are recorded in 15 seconds of fixed window)

4ondition 2verage

'ime 1rror

2verage

0alse/etection

'ime 1rror

2verage

4orrect/etection

'ime 1rror

0alse

/etection:ate

String

Numbers

'otal

/etection:ate

String

:ecognition2ccuracy

)w#i 1@S-

String

:ecognition2ccuracy

)w#o 1@S-

@verall 1.82sec"n#s

0 sec"n#s 1.82

sec"n#s

0 121 96.69 50.41 29.75

@ffice4losetalE

*.=seconds

F seconds *.=seconds

FD 9* *FFD BB.BCD B*.+FD

@ffice2rmlength

*.<seconds

F seconds *.<seconds

FD 9F *FFD B=.FFD B=.FFD

4afG4losetalE

*.CBseconds

F seconds *.CBseconds

FD


26/45



2nalysis of the Simulation :esult Why didnHt 1@S

detection worE well in babble noise?



27/45



@ptimal /etection /ecision

& 3ayes classifier

& AiEelihood :atio 'est

)$%(&"g')$%(&"g' n f

H

H

ns f

n

s

〈

〉

$)(

)(&"g'!)(

)$%(&"g')$%(&"g')(

ns f

n f T T

H

H

L

n f ns f L

"ayes "ayes

n

s

=

〈

〉



28/45

CS & gy


/igit “one! in closetalEing mic, $uiet office



29/45

gy


/igit “one! in handsfree mic, == mil#h car



30/45

gy


/igit “one! in fartalEing mic, cafeteria



31/45

gy


5. Conclusion:• New voice-metric based end-of-speech detector is robust over a wide

variety of background noise.

• Only a small increase in the computational complexity will be brought by

new voice-metric based end-of-speech detector and it can be real-time

implementable.

• New voice-metric based end-of-speech detector can improve recognition

performance by discarding extra noise due to the fixed data capture

window.

• New voice-metric based end-of-speech detector needs further improvement

in the babble noise environment.



32/45


Session 2. Speech Enhancement Algorithms: Blind

Source Separation Methods (Spatial and TemporalFiltering)1. Motivation and research goal.

2. Statement of “blind source separation” problem.

3. Principles of blind source separation.

4. Criteria for blind source separation.

5. Application to blind channel equalization for digital

communication systems.6. Simulation and comparison.

7. Summary and conclusion.



33/45


1. Motivation:

• Mimic human auditory system to differentiate the subject signals from othersounds, such as interfered sources, background noise for clear recognition of the

subject contents.

• ‘One of the most striking facts about our ears is that we have two of them--and

yet we hear one acoustic world; only one voice per speaker .’ (E. C. Cherry andW. K. Taylor. Some further experiments on the recognition of speech, with one

and two ears. Journal of the Acoustic Society of America, 26:554-559, 1954)

• The ‘‘cocktail party effect’’--the ability to focus one’s listening attention on a

single talker among a cacophony of conversations and background noise--hasbeen recognized for some time. This specialized listening ability may be because

of characteristics of the human speech production system, the auditory system, or

high-level perceptual and language processing.



34/45


Research Goal:

Design a preprocessor with digital signal processing speech

enhancement algorithms. The input signals are collected through

multiple sensor (microphone) arrays. After the computation of

embedded signal processing algorithms, we have clearly separated

signals at the output.



35/45


2udio Input

3lind Source Separation 2lgorithms

1nhanced @utput



36/45


2. Problem Statement of Blind Source Separation:

What is “ Blind Source Separation!?

Sensor 1 Sensor N

Signal 1 Signal M

Received input signals

Given the N linearly mixed received input signals,

we need to recover the M statistically independent

sources as much as possible ( ). # N ≥



37/45


Formulation of Blind Source Separation Problem:

2 received signal vector from the array, $ )t -, is the original source vector S )t -

through the channel distortion H )t -, such that $ )t - H )t - S )t -, where

and

We need to estimate a separator W )t - such that

where

[ ] [ ]T M T

N t st st S t xt xt X -)-)-),-)-)-) ** ==

=-)-)

-)

-)-)

-)

*

***

t ht h

t h

t ht h

t H

NM N

ij

M

⊗

[ ] -)-)FF-)-)-)6

* t X t W t st st S T

M ⊗=≈

=

-)-)

-)

-)-)

-)

*

***

t wt w

t w

t wt w

t W

NN N

p

N



38/45


3. Principles of Blind Source Separation:

'he independence measurement ShannonHs M!t!a" in#ormation.

F-,,,)-)-,,,) 9**

9* ≥−∑==

N

N

ii N $ $ $ H $ H $ $ $ %

∑−=

=

$

i i $ N & N $ # ' $ $ $ # ' $ $ $ %

i*9*9*-JK)LlogM-JK,,,)LlogM-,,,)



39/45


4. Criteria to Separate Independent Sources:

• Constrained Entropy (Wu, IJCNN99):–

• Hardamard Measure (Wu, ICA99):

–

• Frobenius Norm (Wu, NNSP97):

–

• Quadratic Gaussianity (Wu, NNSP99):

–

∑

=

N

i i i i y f W %

101 )$!!(&"g'$)#et(&"g'

)$'&"g($)'(&"g2T T ' ' diag %

2

* $)'($'(

T T ' diag ' %

i i ) i dy y f y f % i

2

4 )()( ∞

∞



40/45


We apply the minimization of modified constrained entropy

to adapt an equalizer w(t ) =[w0, w1, ....] for

a digital channel h(t ). Assume a PAM signal constellation with symbols s(t ) = ,

passing through a digital channel h(t ) = [c(t , 0.11) + 0.8c(t -1, 0.11) - 0.4c(t -3,

0.11)]W 6T (t ),

where is raised-cosine function with

roll-off factor β and is a rectangular window. the input signal

to the equalizer is where n(t ) is the background noise.

We applied generalized anti-Hebbian learning to adapt w(t )

such that .

5. Application to Blind Single Channel Equalization

for Digital Communication Systems:∑

=

N

i i i i

N y f w %

101 )$!!(&"g')&"g(

1

2

22

41

)c"s()(sin)!(

T

t

T

t

T

t c

t c

π

=

)()()( t t ht w δ

)6

(6T

t rect W T =

∑

τ

τ )()()()( t nst ht



41/45


Signaltonoise :atio )d3-

Signalto

interf

erence:atio)d3

-



42/45


Signaltonoise :atio )d3-

3 i t 1 r r

o r : a t e



43/45


6. Simulation and Comparison:

The simulation results for comparison among our generalized

anti-Hebbian learning, SDIF algorithm and Lee’s Informax method

(Lee IJCNN97) over three real recordings downloaded from Salk

Institute, University of California at San Diego.



44/45


New >: AI'1 0rontend 3lind Source Separation 1nd

ofspeech /etection

schemes 2verage/etection

'ime 1rror

2verage0alse

/etection'ime1rror

2verage4orrect

/etection'ime1rror

Number of Strings

0alse/etection

:ate

'otal/etection

:ate

1@Sonly

F.9=Bseconds

F.*==seconds

F.;*Cseconds

*< C.*


45/45

7. Conclusion and Future Research:

& 'he computational efficiency of blind source separation needs

to be reduced.

& 'est 3SS for 1@S detection under microphone arrays of the

same Eind.

& Incorporate other array signal processing )beamformer?-

techni$ue to improve speech detection and recognition.

Documents

Demo End of Speech