View
330
Download
0
Category
Preview:
Citation preview
Endpoint Detection( 端點偵測 )
Jyh-Shing Roger Jang (張智星 )http://mirlab.org/jangMIR Lab, CSIE Dept
National Taiwan Univ., Taiwan
-2-
Intro to Endpoint DetectionEndpoint detection (EPD, 端點偵測 )
Goal: Determine the start and end of voice activity Also known as voice activity detection (VAD)
Importance Acts as a preprocessing step for many recognition tasks Requires as small computing power as possible
Two activation modes for speech-base applications Push to talk once Offline EPD
Example: voice command Push for continuously listening Online EPD
Example: Dictation machine
Quiz candidate!
-3-
Types of Features for EPDTime-domain
Volume only Volume and ZCR (zero
crossing rate) Volume and HOD (high-
order difference) …
Frequency-domain Variance of spectrum Entropy of spectrum MFCC …
-4-
Typical Frameworks to EPDThresholding
Simple thresholdingCompute a feature (e.g.,
volume) from each frameSelect a threshold vth to
identify positive frames Combined thresholding
Use two features (e.g., volume and ZCR) to make decision
Static classification Take features Perform binary
classificationNegativesil or noisePositivesound activity
Sequence alignment Use hidden Markov
models (HMM) for sequence alignment
-5-
Performance Evaluation for EPD
Two types of errors (typical for all binary classification) False negative (aka false
rejection)positive negative
False positive (aka false acceptance)
negative positive
Performance evaluation Start & end position
accuracy Frame-based accuracy
Quiz candidate!
-6-
EPD by Volume Thresholding
The simplest method for EPD Volume is based on abs sum of frames.
Four intuitive way to select vth: vth = vmax* vth = vmedian* vth = vmin* vth = v1*
-7-
How Do They Fail?
Unfortunately… All the thresholds fail one way or another. Under what situations do they fail?
vth = vmax*Plosive soundsvth = vmedian*Silence too longvth = vmin*Total-zero framevth = v1*Unstable frame
We need a a better strategy…
-8-
A Better Strategy for Threshold Finding
A presumably better way to select vth
vlower = 3rd percentile of volumes vupper = 97th percentile of volumes vth = (vupper-vlower)*+vlower
Why do we need to use percentile? To deal with plosive sounds To deal total-zero frames
Does it fail? Yes, still, in certain situation…
-9-
Example: EPD by Volume
epdByVol01.m
0.5 1 1.5 2
Am
plitu
de
-1
-0.5
0
0.5
1Waveform and EP (method=vol)
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Vo
lum
e
0
50
100
Volume
Play all Play detected
-10-
How to Enhance EPD by Volume?
Major problem of EPD by volume Threshold is hard to determine
Corpus-based fine-tuning
Unvoiced parts are likely to be ignoredWe need a features to enhance the unvoiced partsThis can be achieved by ZCR or HOD
-11-
ZCR for Unvoiced Sound Detection
ZCR: zero crossing rate No. of zero crossing in a frame zvoiced ≤ zsilence ≤ zunvoiced
Example: epdShowZcr01.m
0.5 1 1.5 2
Am
plitu
de
-1
-0.5
0
0.5
1SingaporeIsAFinePlace.wav
Time (sec)
0.5 1 1.5 2
Cou
nt
0
50
100
150
200ZCR
Play Wave
Quiz:If frame=[-1 2 -2 3 5 2 -2 1],what is its ZCR?
Quiz candidate!
-12-
EPD by Volume and ZCR
1. Determine initial endpoints by u
2. Expand the initial endpoints based on l
3. Further expand the endpoints based on ZCR threshold zc
-13-
Example: EPD by Volume and ZCR
epdByVolZcr01.m
0.5 1 1.5 2Am
plit
ude
-1
0
1Waveform and EP (method=volZcr)
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2Vol
ume
2060
100
Volume
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
ZC
R
0
50
Zero crossing rate
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2Am
plit
ude
-1
0
1Waveform after EPD
Play all Play detected
-14-
EPD by Volume and HOD
Another feature to enhance unvoiced sounds: High order difference
Order-1 HOD = sum(abs(diff(s)))Order-2 HOD = sum(abs(diff(diff(s))))Order-3 HOD = sum(abs(diff(diff(diff(s)))))…
Quiz:If frame=[-1 2 -2 3 -3 2 -2 1], what is its order-1 HOD?
-15-
Example: Plots of Volume and HOD
highOrderDiff01.m
0 0.5 1 1.5 2 2.5
Am
plitu
de
-1
-0.5
0
0.5
1Waveform
Time (sec)
0 0.5 1 1.5 2 2.50
50
100
VolumeOrder-1 diff
Order-2 diff
Order-3 diffOrder-4 diff
-16-
Example: EPD by Vol. and HOD
epdByVolHod01.m
0.5 1 1.5 2
Am
plitu
de
-1
0
1Waveform and EP (method=volHod)
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2Vo
lum
e &
HO
D
0.5
1Volume & HOD
Volume
HOD
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
VH
0
0.5
VH
Play all Play detected
-17-
Hard Example: EPD by Vol. and HOD
A hard example: epdByVolHod02.m
1 2 3 4 5 6
Am
plitu
de
-1
0
1Waveform and EP (method=volHod)
1 2 3 4 5 6
Vo
lum
e &
HO
D
0.5
1Volume & HOD
Volume
HOD
1 2 3 4 5 6
VH
0
0.5
VH
Play all Play detected
-18-
EPD by Spectrum
epdShowSpec01.m epdShowSpec02.m
0.5 1 1.5 2
Am
plitu
de
-1
-0.5
0
0.5
1SingaporeIsAFinePlace.wav
Time (sec)
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Fre
q (H
z)
0
2000
4000
6000
8000
Play Wave
1 2 3 4 5 6
Am
plitu
de
-1
-0.5
0
0.5
1noisy4epd.wav
Time (sec)
1 2 3 4 5 6
Fre
q (H
z)
0
2000
4000
Play Wave
-19-
How to Aggregate Spectrum?
How to aggregate spectrum as a single feature which is larger (or smaller) when the spectral energy distribution is diversified? Entropy function Geometric mean over arithmetic mean
-20-
Entropy Function
Entropy function
Property
Proof…
n
iii
n
iiin
pppentropy
pppppp
1
121
ln)(
1 and i,0,,...,
./1... when maximum its achieves )( 21 nppppentropy n
Quiz candidate!
-21-
Plots of Entropy Function
N=2
entropyPlot.m
N=3
-22-
Spectral Entropy
PDF: Normalization
Spectral entropy:
Nifs
fsp N
kk
ii ,...,1,
)(
)(
1
HzforHzfiffs iii 60002500)(
120 iii porpifp
N
kkk ppH
1
log
Reference: Jialin Shen, Jeihweih Hung, Linshan Lee, “Robust entropy-based endpoint detection for speech recognition in noisy environments”, International Conference on Spoken Language Processing, Sydney, 1998
-23-
Geometric/Arithmetic Means
Arithmetic & Geometric means
Property
Proof…
n
ii
n
ii
in
ppgmppam
ippppp
)(,)(
,0 and ,..., 21
nppppam
pgmpgmpam ... when maximum its achieves
)(
)()()( 21
Quiz candidate!
Recommended