Upload
lorraine-powell
View
258
Download
0
Tags:
Embed Size (px)
8-Speech Recognition
Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types
1
7-Speech Recognition (Cont’d)
HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training In HMM
2
Recognition Tasks Isolated Word Recognition (IWR)
Connected Word (CW) , And Continuous Speech Recognition (CSR)
Speaker Dependent, Multiple Speaker, And Speaker Independent
Vocabulary SizeSmall <20Medium >100 , <1000Large >1000, <10000Very Large >10000
3
Speech Recognition Concepts
4
NLPSpeech
Processing
Text Speech
NLPSpeech
ProcessingSpeech
Understanding
Speech Synthesis
TextPhone Sequence
Speech Recognition
Speech recognition is inverse of Speech Synthesis
Speech Recognition Approaches
Bottom-Up Approach
Top-Down Approach
Blackboard Approach
5
Bottom-Up Approach
6
Signal Processing
Feature Extraction
Segmentation
Signal Processing
Feature Extraction
Segmentation
Segmentation
Sound Classification Rules
Phonotactic Rules
Lexical Access
Language Model
Voiced/Unvoiced/Silence
Kno
wle
dge
Sou
rces
Recognized Utterance
Top-Down Approach
7
UnitMatching
System
FeatureAnalysis
LexicalHypothesis
SyntacticHypothesis
SemanticHypothesis
UtteranceVerifier/Matcher
Inventory of speech
recognition units
Word Dictionary Grammar
TaskModel
Recognized Utterance
Blackboard Approach
8
EnvironmentalProcesses
Acoustic Processes Lexical
Processes
SyntacticProcesses
SemanticProcesses
Blackboard
Recognition Theories
Articulatory Based RecognitionUse from Articulatory system for recognitionThis theory is the most successful until now
Auditory Based RecognitionUse from Auditory system for recognition
Hybrid Based RecognitionIs a hybrid from the above theories
Motor TheoryModel the intended gesture of speaker
9
Recognition Problem
We have the sequence of acoustic symbols and we want to find the words that expressed by speaker
Solution : Finding the most probable of word sequence by having Acoustic symbols
10
Recognition Problem
A : Acoustic Symbols W : Word Sequence
we should find so that
11
W)|(max)|ˆ( AWPAWP
W
Bayse Rule
),()()|( yxPyPyxP
12
)(
)()|()|(
yP
xPxyPyxP
)(
)()|()|(
AP
WPWAPAWP
Bayse Rule (Cont’d)
13
)(
)()|(max
AP
WPWAPW
)|(max)|ˆ( AWPAWPW
)()|(max
)|(maxˆ
WPWAPArg
AWPArgW
W
W
Simple Language Model
14
nwwwww 321
),...,,,(
),...,,|(
).....,,|(
),|()|()(
)|()(
121
121
1234
123121
1211
WWWWP
WWWWP
WWWWP
WWWPWWPWP
wwwwPwP
nnn
nnn
iii
n
i
Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.
Simple Language Model (Cont’d)
15
)|()( 211
iii
n
iwwwPwP
)|()( 11
ii
n
iwwPwP
Trigram :
Bigram :
)()(1
i
n
iwPwP
Monogram :
Simple Language Model (Cont’d)
16
)|( 123 wwwP
Computing Method :Number of happening W3 after W1W2
Total number of happening W1W2
AdHoc Method :
)()|()|()|( 332321231123 wfwwfwwwfwwwP
Error Production Factor
Prosody (Recognition should be Prosody Independent)
Noise (Noise should be prevented)
Spontaneous Speech
17
P(A|W) Computing Approaches
Dynamic Time Warping (DTW)
Hidden Markov Model (HMM)
Artificial Neural Network (ANN)
Hybrid Systems
18
Dynamic Time Warping
Dynamic Time Warping
Dynamic Time Warping
Dynamic Time Warping
Dynamic Time Warping
Search Limitation :
- First & End Interval
- Global Limitation
- Local Limitation
Dynamic Time Warping
Global Limitation :
Dynamic Time Warping
Local Limitation :
Artificial Neural Network
26
...
1x
0x
1w0w
1Nw
1Nx
y)(
1
0
i
N
ii xwy
Simple Computation Element of a Neural Network
Artificial Neural Network (Cont’d)
Neural Network TypesPerceptronTime DelayTime Delay Neural Network Computational
Element (TDNN)
27
Artificial Neural Network (Cont’d)
28
. . .
. . .
0x
0y 1My
1Nx
Single Layer Perceptron
Artificial Neural Network (Cont’d)
29
. . .
. . .
Three Layer Perceptron
. . .
. . .
2.5.4.2 Neural Network Topologies
30
TDNN
31
2.5.4.6 Neural Network Structures for Speech Recognition
32
2.5.4.6 Neural Network Structures for Speech Recognition
33
Hybrid Methods
Hybrid Neural Network and Matched Filter For Recognition
34
PATTERN
CLASSIFIER
SpeechAcoustic Features Delays
Output Units
Neural Network Properties
The system is simple, But too much iteration is needed for training
Doesn’t determine a specific structure Regardless of simplicity, the results are
good Training size is large, so training should
be offline Accuracy is relatively good
35
Pre-processing
Different preprocessing techniques are employed as the front end for speech recognition systems
The choice of preprocessing method is based on the task, the noise level, the modeling tool, etc.
36
38
39
41
42
MFCCروش
روش MFCCبر نحوه ادراک گوش انسان از اصوات ي مبتن باشد.يم
روش MFCCيها در محير وي نسبت به ساM بهتر ي نويزيطهايژگکند.يعمل م
MFCCجهت کاربردها Y ه شده ي گفتار ارايي شناساي اساسا دارد.يز راندمان مناسبينده ني گويياست اما در شناسا
دار گوش انسان ي واحد شنMelباشد که به کمک رابطه ي م د:ي آير بدست ميز
43
MFCCمراحل روش
گنال از حوزه زمان به حوزه ي: نگاشت س1 مرحله زمان کوتاه.FFTفرکانس به کمک
44
گنال گفتاريس : Z(n)تابع پنجره مانند پنجره :
)W(nهمينگWF= e-j2π/F
m : 0,…,F – 1;يم گفتاريطول فر : .F
MFCCمراحل روش
لتر.ي هر کانال بانک فيافتن انرژي: 2مرحله
Mبر معيار مل ي فيلتر مبتني تعداد بانکها باشد.يم
بانک فيلتر يلترهاي تابع فاست.
45
0,1,..., 1k M ( )kW j
توزيع فيلتر مبتنی بر معيار مل
46
MFCCمراحل روش
ل ي طيف و اعمال تبدي: فشرده ساز4 مرحلهDCT MFCCب يجهت حصول به ضرا
47
در رابطه باالL،...،0=nب ي مرتبه ضراMFCC باشد.يم
روش مل-کپستروم
48
Mel-scaling بندی فریم
IDCT
|FFT|2
Low-order coefficientsDifferentiator
Cepstra
Delta & Delta Delta Cepstra
زمانی سیگنال
Logarithm
ضرایب مل MFCC)کپستروم
)
49
ویژگی های مل (MFCC)کپستروم
نگاشت انرژی های بانک فیلترملدرجهتی که واریانس آنها ماکسیمم
(DCT )با استفاده ازباشد استقالل ویژگی های گفتار به صورت
(DCT غیرکامل نسبت به یکدیگر)تاثیرپاسخ مناسب در محیطهای تمیزکاهش کارایی آن در محیطهای نویزی
50
Time-Frequency analysis
Short-term Fourier Transform Standard way of frequency analysis: decompose the
incoming signal into the constituent frequency components.
W(n): windowing function N: frame length p: step size
51
Critical band integration
Related to masking phenomenon: the threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise
Frequency components within a critical band are not resolved. Auditory system interprets the signals within a critical band as a whole
52
Bark scale
53
Feature orthogonalization
Spectral values in adjacent frequency channels are highly correlated
The correlation results in a Gaussian model with lots of parameters: have to estimate all the elements of the covariance matrix
Decorrelation is useful to improve the parameter estimation.
54
otherwise
validiswwifwwP
wwwPwwwwP
wwwwP
wwwPwwPwPwwwPWP
wwwW
jkkj
jNjjjQ
Q
Q
0
1)|(
),|()|(
|(
)|()|()()()(
,
11121
).121
21312121
21
Language Models for LVCSR
Word Pair Model: Specify which word pairs are valid
Statistical Language Modeling
)(
)(
)(
),(
),(
),,(),|(ˆ
,),,(
),,,(),,|(ˆ
),,,,|()(
13
1
212
21
3211213
11
1111
1211
i
Nii
NiiiNiii
Niii
Q
iiN
wF
wFp
wF
wwFp
wwF
wwwFpwwwP
wwF
wwwFwwwP
wwwwPWP
),,,(log1
lim
)(log)(
)()()(),,,(
),,,(log),,,(1
lim
21
2121
2121
Vw
QQQ
wwwPQ
H
wPwPH
wPwPwPwwwP
wwwPwwwPQ
H
Perplexity of the Language Model
Entropy of the Source:
First order entropy of the source:
If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out,
H
Qp
Ni
Q
iiiip
Q
wwwPB
wwwPQ
H
wwwwPQ
H
wwwPQ
H
p /121
21
11
21
21
),,,(ˆ2
),,,(ˆlog1
),,,|(log1
),,,(log1
We often compute H based on a finite but sufficiently large Q:
H is the degree of difficulty that the recognizer encounters, on average,When it is to determine a word from the same source.
Using language model, if the N-gram language model PN(W) is used,An estimate of H is:
In general:
Perplexity is defined as:
Overall recognition system based on subword units