Upload
bridget-dickerson
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
PCS Research & Advanced Technology Labs
Speech Lab
How to deal with the noise in real systems?
Hsiao-Chun Wu
Motorola PCS Research and Advanced Technology Labs, Speech Laboratory
Phone: (815) 884-3071
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Why do we need to study noise?
Noise exists everywhere. It affects the performance of signal
processing in reality. Since the noise cannot be avoided by system
engineers, modern “noise-processing” technology has been
researched and designed to overcome this problem. Hence many
related research areas have been emerging, such as signal detection,
signal enhancement/noise suppression and channel equalization.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
• Spectral Truncation – Spectral Subtraction (1989):
• Time Truncation– Signal Detection:
• Spatial and/or Temporal Filtering– Equalization:
– Array Signal Separation (Blind Source Separation):
How to deal with noise? Cut it off!!!!
)()(~
)()()(~
)(
fSfNfNfSfS
fR
noiseTnr ),()(:~
)()()()()(~
)(
tststhtwts
tr
)()()()()(~
)(
tStStHtWtS
tR
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Session 1. On-line Automatic End-of-speech Detection Algorithm (Time Truncation)
1. Project goal.
2. Review of current methods.
3. Introduction to voice metric based end-of-speech detector.
4. Simulation results.
5. Conclusion.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
1. Project Goal:
• Problem
– Digit-dial recognition with unknown digit string length
• Solution 1
– fixed length window such as 10 seconds? (inconvenience to users)
• Solution 2
– Dynamic termination of data capture? (need a robust detection
algorithm)
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
• Research and design a robust dynamic termination mechanism for speech
recognizer.
– a new on-line automatic end-of-speech detection algorithm with small
computational complexity.
• Design a more robust front end to improve the recognition accuracy for
speech recognizers.
– a new algorithm can also decrease the excessive feature extraction of redundant
noise.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
2. Review of Current Methods:
Most speech detection algorithms can be characterized into three categories.
• Frame energy detection– short-term frame energy (20 msec) can be used for speech/noise
classification.
– it is not robust at large background noise levels.
• Zero-crossing rate detection– short-term zero-crossing rate can also be used for speech/noise
classification.
– it is not robust in a wide variety of noise types.
• Higher-order-spectral detection– short-term higher-order spectra can be used for speech/noise
classification.
– it implies a heavy computational complexity and its threshold is difficult to be pre-determined.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
3. Introduction to Voice Metric Based End-of-speech Detector:
• End-of-speech detection using voice metric features is based on the Mel-
energies. Voice metric features are robust over a wide variety of background
noise. Originally voice metric based speech/noise classifier was applied for
IS-127 CELP speech coder standard. We modify and enhance voice-metric
features to design a new end-of-speech detector for Motorola voice
recognition front end (VR LITE III).
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
voice metric score table
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Pre-S/NClassifier
VoiceMetric
Mel-Spectrum
SNREstimate
EOSBuffer
ThresholdAdaptation
raw dataFFT
Speech Start?
Silence Duration
Threshold
Post-S/NClassifier
voice metric scores
Original VR LITE Front End
End-of-speech Detector data capture stops
yes
no
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
VR LITE recognition engine
feature vector frame buffer
segmentation of speech into frames
data capture terminates
end of speech?
yes
noframe inext frame i+1
speech input
front end with end-of-speech detector
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
6.51 seconds
3.78 seconds
4.81 seconds
raw data
end point
detected end point
String “2-2-9-1-7-8” in Car 55 mph
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Correct detection
End point
False detection
false detection time error
correct detection time error
String “2-2-9-1-7-8” in Car 55 mph
seconds
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
4. Simulation Results: (Simulation is done over Motorola digit-string database, including 16 speakers and 15,166 variable-length digit strings in 7 different conditions. Silence threshold is 1.85 seconds.)
A. Receiver Operating Curve (ROC): ROC curve is the
relationship between the end-of-speech detection rate versus the
false (early) detection rate. We compare two different methods,
namely, (1) new voice-metric based end-of-speech detector and
(2) old speech/noise flag based end-of-speech detector.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
ROC curve
false detection rate (%)
dete
ctio
n ra
te (
%)
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
• B. String-accuracy-convergence (SAC) curve: SAC
curve is the relationship between the string recognition accuracy
versus the false (early) detection rate. We compare two different
methods, namely, (1) new voice-metric based end-of-speech
detector and (2) old speech/noise flag based end-of-speech
detector.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
false detection rate (%)
stri
ng r
ecog
nitio
n ac
cura
cy (
%)
SAC curve
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
C. Table of detection results: (This table illustrates the result among the Madison sub-database including data files with 1.85 seconds or more of silence after end of speech.)
Condition AverageTime Error
AverageFalseDetectionTime Error
AverageCorrectDetectionTime Error
FalseDetectionRate
StringNumbers
TotalDetectionRate
Overall 1.98 sec 1.68 sec 1.85 sec 0.47% 7,418 86.08%OfficeClose-talk
1.97 sec 0 sec 1.93 sec 0% 907 94.82%
OfficeArm-length
1.98 sec 0 sec 1.93 sec 0% 988 93.62%
CaféClose-talk
2.17 sec 0 sec 2.00 sec 0% 1,147 81.87%
Café Arm-length
2.31 sec 0.14 sec 2.00 sec 0.11% 898 57.57%
Car Idle(HF)
1.91 sec 1.02 sec 1.84 sec 0.08% 1,210 93.97%
Car35mph(HF)
1.93 sec 0.96 sec 1.77 sec 0.71% 1,130 87.61%
Car 55mph (HF)
1.66 sec 2.00 sec 1.59 sec 2.20% 1,138 89.63%
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
(This table illustrates the result over the small database collected by Motorola PCS CSSRL. All digits strings are recorded in 15 seconds of fixed window)
Condition Average Time Error
Average False
Detection Time Error
Average Correct
Detection Time Error
False Detection
Rate
String Numbers
Total Detection
Rate
String Recognition
Accuracy
(w/i EOS)
String Recognition
Accuracy
(w/o EOS)
Overall 1.82 seconds
0 seconds 1.82 seconds
0% 121 96.69% 50.41% 29.75%
Office Close-talk
1.85 seconds
0 seconds 1.85 seconds
0% 21 100% 66.67% 61.90%
Office-Arm-length
1.84 seconds
0 seconds 1.84 seconds
0% 20 100% 65.00% 65.00%
Café Close-talk
1.76 seconds
0 seconds 1.76 seconds
0% 40 100% 40.00% 15.00%
Café Arm-length
1.85 seconds
0 seconds 1.85 seconds
0% 40 90% 45.00% 10.00%
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Analysis of the Simulation Result: Why didn’t EOS detection work well in babble noise?
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Optimal Detection Decision
• Bayes classifier
• Likelihood Ratio Test
)]|(log[)]|(log[ xnf
H
H
xnsf
n
s
])(
)(log[,)(
)]|(log[)]|(log[)(
nsf
nfTT
H
H
xL
nxfnsxfxL
BayesBayes
n
s
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Digit “one” in close-talking mic, quiet office
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Digit “one” in handsfree mic, 55 mil/h car
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Digit “one” in far-talking mic, cafeteria
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
5. Conclusion:• New voice-metric based end-of-speech detector is robust over a wide
variety of background noise.
• Only a small increase in the computational complexity will be brought by
new voice-metric based end-of-speech detector and it can be real-time
implementable.
• New voice-metric based end-of-speech detector can improve recognition
performance by discarding extra noise due to the fixed data capture
window.
• New voice-metric based end-of-speech detector needs further improvement
in the babble noise environment.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Session 2. Speech Enhancement Algorithms: Blind
Source Separation Methods (Spatial and Temporal Filtering)
1. Motivation and research goal.
2. Statement of “blind source separation” problem.
3. Principles of blind source separation.
4. Criteria for blind source separation.
5. Application to blind channel equalization for digital
communication systems.
6. Simulation and comparison.
7. Summary and conclusion.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
1. Motivation:
• Mimic human auditory system to differentiate the subject signals from other sounds, such as interfered sources, background noise for clear recognition of the subject contents.
• ‘One of the most striking facts about our ears is that we have two of them--and yet we hear one acoustic world; only one voice per speaker.’ (E. C. Cherry and W. K. Taylor. Some further experiments on the recognition of speech, with one and two ears. Journal of the Acoustic Society of America, 26:554-559, 1954)
• The ‘‘cocktail party effect’’--the ability to focus one’s listening attention on a single talker among a cacophony of conversations and background noise--has been recognized for some time. This specialized listening ability may be because of characteristics of the human speech production system, the auditory system, or high-level perceptual and language processing.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Research Goal:
Design a preprocessor with digital signal processing speech
enhancement algorithms. The input signals are collected through
multiple sensor (microphone) arrays. After the computation of
embedded signal processing algorithms, we have clearly separated
signals at the output.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Audio Input
Blind Source Separation Algorithms
Enhanced Output
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
2. Problem Statement of Blind Source Separation:
What is “Blind Source Separation”?
Sensor 1 Sensor N
Signal 1 Signal M
Received input signals
Sensor 1 Sensor N
Signal 1 Signal M
Received input signals
Given the N linearly mixed received input signals, we need to recover the M statistically independentsources as much as possible ( ).MN
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Formulation of Blind Source Separation Problem:
A received signal vector from the array, X(t), is the original source vector S(t)
through the channel distortion H(t), such that X(t) = H(t) S(t), where
and
We need to estimate a separator W(t) such that
where
TMT
N tststStxtxtX )()()(,)()()( 11
)()(
)(
)()(
)(
1
111
thth
th
thth
tH
NMN
ij
M
)()(00)()()(~
1 tXtWtststS TM
)()(
)(
)()(
)(
1
111
twtw
tw
twtw
tW
NNN
pq
N
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
3. Principles of Blind Source Separation:
The independence measurement: Shannon’s Mutual information.
0),,,()(),,,( 211
21
N
N
iiN yyyHyHyyyI
y
iiyNYN yfEyyyfEyyyI
i1
2121 )]}({log[)]},,,({log[),,,(
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
4. Criteria to Separate Independent Sources:
• Constrained Entropy (Wu, IJCNN99):
–
• Hardamard Measure (Wu, ICA99):
–
• Frobenius Norm (Wu, NNSP97):
–
• Quadratic Gaussianity (Wu, NNSP99):
–
N
iiiiyfWJ
101 )],,(log[])det(log[
)][log(}])[(log{2TT YYEYYEdiagJ
23 ])[(][
F
TT YYEdiagYYEJ
iiGiY dyyfyfJi
24 )()(
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
We apply the minimization of modified constrained entropy
to adapt an equalizer w(t) =[w0, w1, ....] for
a digital channel h(t). Assume a PAM signal constellation with symbols s(t) = , passing through a digital channel h(t) = [c(t, 0.11) + 0.8c(t-1, 0.11) - 0.4c(t-3, 0.11)]W6T(t),
where is raised-cosine function with
roll-off factor and is a rectangular window. the input signal
to the equalizer is where n(t) is the background noise. We
applied generalized anti-Hebbian learning to adapt w(t)
such that .
5. Application to Blind Single Channel Equalization for Digital Communication Systems:
N
iiii
N yfwJ1
01 )],,(log[)log(
1
2
2241
)cos()(sin),(
T
tTt
Tt
ctc
)()()( tthtw
)6
(6 T
trectW T
)()()()( tnsthtx
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Signal-to-noise Ratio (dB)
Sign
al-t
o-i n
t erf
eren
ce R
atio
(d B
)
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
Signal-to-noise Ratio (dB)
Bit
Err
or R
ate
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
6. Simulation and Comparison:
The simulation results for comparison among our generalized
anti-Hebbian learning, SDIF algorithm and Lee’s Informax method
(Lee IJCNN97) over three real recordings downloaded from Salk
Institute, University of California at San Diego.
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
New VR LITE Frontend: Blind Source Separation + End-of-speech Detection
schemes AverageDetection
Time Error
AverageFalse
DetectionTimeError
AverageCorrect
DetectionTimeError
Number ofStrings
FalseDetection
Rate
TotalDetection
Rate
EOSonly
0.256seconds
0.155seconds
0.317seconds
14 7.14% 42.86%
BSS+EOS
0.236seconds
0.125seconds
0.322seconds
14 7.14% 50.00%
PCS Research & Advanced Technology Labs
Speech Lab November 14, 2000
7. Conclusion and Future Research:
• The computational efficiency of blind source separation needs to
be reduced.
• Test BSS for EOS detection under microphone arrays of the same
kind.
• Incorporate other array signal processing (beamformer?)
technique to improve speech detection and recognition.