Upload
harry-maxwell
View
217
Download
0
Embed Size (px)
DESCRIPTION
3 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State University of New York Most Typical Speech Features for ASR Spectral Features (Static Features) Represent the vocal tract information MFCCs (Mel-Frequency Cepstral Coefficients) Temporal Features (Dynamic Features) Capture time variation (trajectory) of spectral features Delta and Delta-Delta terms of MFCCs
Citation preview
1Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Spectral/Temporal Acoustic Features for Automatic Speech Recognition
Stephen A. Zahorian, Hongbing Hu, Jiang WuDepartment of Electrical and Computer Engineering
Binghamton UniversityNovember 16th, 2010
2Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Overview of talk Background/Introduction Review of traditional spectral/temporal features DCTC/DCS features Experimental results Conclusions
3Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Most Typical Speech Features for ASR Spectral Features (Static Features)
Represent the vocal tract information MFCCs (Mel-Frequency Cepstral Coefficients)
Temporal Features (Dynamic Features) Capture time variation (trajectory) of spectral features Delta and Delta-Delta terms of MFCCs
4Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
MFCCs (Mel-Frequency Cepstral Coefficients)
Mel-Frequency Scale
The coefficients ci are calculated from the log filter-bank amplitudes using the Cosine transform
0 0.5 1 1.5 2 2.5 3 3.5 40
0.5
1
1.5
2
Frequency (kHz)
Mel scale filter banks (20)
)700
1(log2595)( 10ffMel
N
jji j
Nim
Nc
1
))5.0(cos(2 N: Number of banksmj: Log amplitudes
5Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Speech Recognition Architecture
Recognizer (HMM/NN)
i n i: d sil e
I need a
Speech Waveform Feature
Extraction
Speech Features
Phonemes
Words
Classification
(Recognition)
6Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Hidden Markov Models (HMMs) Speech vectors are
generated by a Markov model
The overall probability is calculated as the product of the transition and output probabilities
Likelihood can be approximated by only considering the most likely state sequence
7Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
DCTC Features Discrete Cosine Transform Coefficients (DCTCs) Given the spectrum X with the frequency f normalized to a [0, 1]
range, the ith DCTC is calculated:
First 3 DCTC basis vectors
1
0)()))((()( dfffgXaiDCTC i
Basis vector : dfdgfigfi )](cos[)(
a(X): nonlinear amplitude scaling (log)g(f): nonlinear frequency warping (Mel-like function)
0 2 4 6 8-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Frequency [kHz]
Am
plitu
de
BV0
BV1BV2
8Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
DCS Features Discrete Cosine Series Coefficients (DCSCs)
Represent the spectral evolution of DCTCs over time and encode the modulation spectrum
1
0)())(,(),( dttthiDCTCjiDCSC j
Basis vectors: dtdhtihtj )](cos[)(
h(t): time “warping” function—non-uniform time resolution
First 3 DCSC basis vectors
-60 -40 -20 0 20 40 60-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Time [ms]
Am
plitu
de
BV0
BV1BV2
9Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Example Original spectrogram, and its rebuilt version with different selection
of features.
Original spectrogramRebuilt with 13 DCTC and
3 DCS termsRebuilt with 8 DCTC and 5
DCS terms
Time (Sec)
Freq
uenc
y (H
z)
Original Spectrogram
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50
1000
2000
3000
4000
5000
6000
7000
8000
Time (Sec)
Freq
uenc
y (H
z)
Rebuilt Spectrogram
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50
1000
2000
3000
4000
5000
6000
7000
8000
Time (Sec)
Freq
uenc
y (H
z)
Rebuilt Spectrogram
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50
1000
2000
3000
4000
5000
6000
7000
8000
10Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
DCTC/DCS Computation
Time (Sec)
Freq
uenc
y (H
z)
Original Spectrogram
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.50
1000
2000
3000
4000
5000
6000
7000
8000
z
DCS1 DCS2DCS3
DCTC 1DCTC 2DCTC 3DCTC 4DCTC 5
Frame Length Block Length
Spectrogram
DCTC/DCSFeatures
11Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Experimental Evaluation Database
Recognizer: HMMs Left-to-right Markov models with no skip 48 monophone HMMs are created using the HTK toolkit Bigram phone information was used as the language model
Cambridge University/Microsoft HTK toolkit (Ver3.4) Provide powerful tools for data preparation, HMM training and testing,
result analysis
TIMIT database (“SI” and “SX” only)Phoneme Reduced 48 phone set mapped down
from the TIMIT 62 phone setTraining data 3696 sentences (462 speakers)Testing data 1344 sentences (168 speakers)
12Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Experimental Evaluation TIMIT database
630 total speakers, 10 sentences each 462 speakers for training, 168 test speakers
3 state HMM phone models Results given as phone accuracy for 39
“standard” phone categories Number of mixtures per state “relatively” high to
maximize accuracy
13Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Evaluation with Static Only Features Vary frame length from 5 ms to 30ms (5ms as the frame
space) Vary number of DCTCs (7, 10, 13, 16, 19) 8 GMM mixtures for each state of HMMs
510
1520
2530
5
10
15
2048
50
52
54
56
58
Frame length /ms
Static only features
Number of used DCTC
Rec
ogni
tion
Rat
e %
14Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Evaluation with Dynamic Features Use small number of DCTCs (1 , 2, 3, or 4), and vary the
number of DCSs Vary the number of frames per block, so that
DCS terms are computed over 50, 100, or 300 ms 10 ms frame length, 5 ms frame space 8 GMM mixtures for each GMM state
15Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
100150
200250
300
34
56
78
30
35
40
45
50
55
60
Block length /ms
Dynamic features w/ 1 DCTC
Number of DCS
Rec
ogni
tion
rate
/%
100150
200250
300
34
56
78
30
35
40
45
50
55
60
Block length /ms
Dynamic features w/ 2 DCTCs
Number of DCS
Rec
ogni
tion
rate
/%
100150
200250
300
34
56
78
30
35
40
45
50
55
60
Block length /ms
Dynamic features w/ 3 DCTCs
Number of DCS
Rec
ogni
tion
rate
/%
100150
200250
300
34
56
78
30
35
40
45
50
55
60
Block length /ms
Dynamic features w/ 4 DCTCs
Number of DCS
Rec
ogni
tion
rate
/%
16Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Evaluation with Spectral/Temporal Features Use 40 features total, and 40 GMM mixtures. Vary frame length and the number of frames per block 2 ms frame space 8 ms block space Vary the combination of different numbers of DCTCs and
DCSs—but fix number of parameters to 40
10ms 15ms 25ms646566676869707172
69.225 69.059268.4554
Frame Length Effect
50ms 100ms 300ms646566676869707172
68.8663
70.2012
67.6721
Block Length Effect
17Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Evaluation with Spectral/Temporal Features
Condition 1: 8 DCTCs and 5 DCSs
Condition 2: 9 DCTCs and 5 DCSs
Condition 3: 10 DCTCs and 4 DCSs Ss
Condition 4: 11 DCTCs and 4 DC
1 2 3 4 5 6 7 8646566676869707172
69.6970.38 70.71
71.19 71.43 71.41 71.11 71.11
Best Result of Each Condition
Condition 5: 12 DCTCs and 4 DCSs
Condition 6: 13 DCTCs and 4 DCSs
Condition 7: 14 DCTCs and 3 DCSs
Condition 8: 15 DCTCs and 3 DCSs
18Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Conclusions from these results Features which represent trajectories of global
spectral shape carry considerable information for ASR.
There are tradeoffs between “static” spectral features and “dynamic” spectral trajectory features
Spectral resolution can be relatively low for spectral ASR features
“Information” in trajectory features is more “dilute” than in spectral features
19Electrical and Computer Engineering Binghamton University, State University of New YorkElectrical and Computer Engineering Binghamton University, State University of New York
Questions?