Upload
conan-gilmore
View
23
Download
1
Tags:
Embed Size (px)
DESCRIPTION
A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks. MS Thesis Defense December 6, 2001 University of Illinois at Chicago Politecnico di Torino. Daniele Quercia [email protected]. Distributed Speech Recognition: Our Focus. Experimental framework. - PowerPoint PPT Presentation
Citation preview
A Simulative Study Of A Simulative Study Of
Distributed Speech Recognition Distributed Speech Recognition
Over Internet Protocol NetworksOver Internet Protocol Networks
A Simulative Study Of A Simulative Study Of
Distributed Speech Recognition Distributed Speech Recognition
Over Internet Protocol NetworksOver Internet Protocol Networks
MS Thesis Defense MS Thesis Defense
December 6, 2001December 6, 2001
University of Illinois at Chicago University of Illinois at Chicago
Politecnico di TorinoPolitecnico di Torino
MS Thesis Defense MS Thesis Defense
December 6, 2001December 6, 2001
University of Illinois at Chicago University of Illinois at Chicago
Politecnico di TorinoPolitecnico di Torino
Daniele QuerciaDaniele [email protected]@studenti.to.it
OutlineOutlineOutlineOutline
Distributed Speech Recognition: Distributed Speech Recognition: Our FocusOur Focus
Experimental frameworkExperimental framework
Experimental resultsExperimental results
Conclusions: the impact of packet Conclusions: the impact of packet losseslosses
2
Internet has proliferated rapidlyInternet has proliferated rapidly
IntroductionIntroductionIntroductionIntroduction
Strong interest in transporting Strong interest in transporting voice over IP networksvoice over IP networks
Novel Internet applications can Novel Internet applications can benefit from Automatic Speech benefit from Automatic Speech Recognition (ASR)Recognition (ASR)
3
Introduction (cont’d)Introduction (cont’d)Introduction (cont’d)Introduction (cont’d)
Desire for speech input for hand-Desire for speech input for hand-held devices (mobile phones, held devices (mobile phones, PDA’s, etc.)PDA’s, etc.)
Speech recognition requires high Speech recognition requires high computation, RAM, and disk computation, RAM, and disk resourcesresources
If hand-held devices connected to If hand-held devices connected to a network, the speech recognition a network, the speech recognition can take place remotely (e.g. ETSI can take place remotely (e.g. ETSI Aurora Project)Aurora Project)
4
Speech recognition task distributed Speech recognition task distributed between two end systems:between two end systems:
client side client side (light-weight)(light-weight)
server side server side
Distributed Speech Distributed Speech Recognition ArchitectureRecognition Architecture
Distributed Speech Distributed Speech Recognition ArchitectureRecognition Architecture
Client SideIP
Network
Server Side
5
Distributed Speech Recognition Distributed Speech Recognition Architecture (cont’d)Architecture (cont’d)
Distributed Speech Recognition Distributed Speech Recognition Architecture (cont’d)Architecture (cont’d)
6
Front-end Packing and Framing
UnpackingRecognizer
IP Network
Client Device
Remote site
Technical challengesTechnical challengesTechnical challengesTechnical challenges
IP networks are not designed for IP networks are not designed for transmitting real-time traffictransmitting real-time traffic
Lack of guarantees in terms of Lack of guarantees in terms of packet losses,packet losses, network delay,network delay, and and delay jitterdelay jitter
7
Technical challenges (cont’d)Technical challenges (cont’d)Technical challenges (cont’d)Technical challenges (cont’d)
The design of Distributed Speech The design of Distributed Speech Recognition systems must consider Recognition systems must consider the effect of the effect of packet lossespacket losses
8
without significant losseswithout significant losses
Speech packets must be received:Speech packets must be received:
with low delaywith low delay
with small delay variation (jitter)with small delay variation (jitter)
Our FocusOur FocusOur FocusOur Focus
Performance evaluationPerformance evaluation of a Distributed Speech of a Distributed Speech Recognition (DSR) system Recognition (DSR) system operating over simulated operating over simulated IP networksIP networks
9
Our Research Our Research ContributionsContributionsOur Research Our Research ContributionsContributions
Evaluation of a standard Evaluation of a standard front-endfront-end that achieves state-of-the-art that achieves state-of-the-art performanceperformance
Simulative study of DSR under Simulative study of DSR under increasingly more realist scenarios:increasingly more realist scenarios:
Random losses Random losses
Gilbert-model lossesGilbert-model losses
Network simulationsNetwork simulations
10
Experimental frameworkExperimental frameworkExperimental frameworkExperimental framework
11
Front-endFront-end
Speech DatabaseSpeech Database
RecognizerRecognizer
Network scenariosNetwork scenarios
Front-endFront-endFront-endFront-endExperimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
12
Front-end extracts from the speech Front-end extracts from the speech signal significant information for signal significant information for recognition recognition
Spectral EnvelopeSpectral Envelope
Front-end (cont’d)Front-end (cont’d)Front-end (cont’d)Front-end (cont’d)
ETSI Aurora Standard Front-end ETSI Aurora Standard Front-end produces 14 coefficientsproduces 14 coefficients
Front-end based on mel coefficients
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
13
Frequency axis is warped closer to Frequency axis is warped closer to perceptual axisperceptual axis
Mel coefficients represent the short-time spectral envelope
RecognizerRecognizerRecognizerRecognizer
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
14
HMM-based speech recognizerHMM-based speech recognizer
HMM model consists of:HMM model consists of:
-transition probabilities a-transition probabilities aijij
-emission distribution b-emission distribution bii(o)(o)
16 states per word16 states per word
3 Gaussian mixtures per 3 Gaussian mixtures per statestate
-states q-states qii
-initial state distribution -initial state distribution ii
P(O|M 1)
Recognizer (cont’d)Recognizer (cont’d)Recognizer (cont’d)Recognizer (cont’d)Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
15
Word 1 Word 3Word 2
M 1 M 2 M 3
TRAINING PHASE
RECOGNITION PHASE
Training examples
Estimated Models
Unknown observation sequence O
P(O|M 3)P(O|M 2)
Evaluated probabilitiesMax Probability chosen
Speech DatabaseSpeech DatabaseSpeech DatabaseSpeech Database
For For trainingtraining, 8440 utterances , 8440 utterances selectedselected
ETSI Aurora ETSI Aurora TIdigitsTIdigits Database 2.0 Database 2.0
For For testtest, 4004 utterances selected , 4004 utterances selected without noise addedwithout noise added
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
16
Network scenariosNetwork scenariosNetwork scenariosNetwork scenarios
Random lossesRandom losses
3 network scenarios was considered:3 network scenarios was considered:
Gilbert-model lossesGilbert-model losses
Network simulationsNetwork simulations
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
17
1 frame per packet1 frame per packet
Packet loss ratios: Packet loss ratios: 10%10% - - 40%40%
Random lossesRandom lossesRandom lossesRandom losses
Each packet has the same loss Each packet has the same loss probabilityprobability
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
18
Random losses (cont’d)Random losses (cont’d)Random losses (cont’d)Random losses (cont’d)
t t +
When the net is congested ...When the net is congested ...
packet packet lossloss
High Probability(packet High Probability(packet loss) !!loss) !!
Generally, packet losses appear in Generally, packet losses appear in burstburst
time
Random losses does not model the Random losses does not model the temporal dependenciestemporal dependencies of loss of loss
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
19
Gilbert-model lossesGilbert-model lossesGilbert-model lossesGilbert-model losses
2-state Markov model:2-state Markov model:
pp = P(next packet lost | previous packet = P(next packet lost | previous packet arrived)arrived)
qq = P(next packet arrived | previous = P(next packet arrived | previous packet lost)packet lost)
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
STATE 1 no loss
STATE 2 loss
1-p
q
1-q
p
20
Gilbert-model losses (cont’d)Gilbert-model losses (cont’d)Gilbert-model losses (cont’d)Gilbert-model losses (cont’d)
Simulated Packet loss ratios: Simulated Packet loss ratios: 10%10% - - 40%40%
2-state Markov model is 2-state Markov model is less less accurateaccurate than a than a nthnth order Markov order Markov model, but (accuracy vs. model, but (accuracy vs. complexity) is better.complexity) is better.
Documented in the literature: Documented in the literature: “Gilbert model is a suitable “Gilbert model is a suitable loss loss modelmodel””
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
21
Network simulationsNetwork simulationsNetwork simulationsNetwork simulations
Network simulations represent more Network simulations represent more realistic IP scenarios realistic IP scenarios
Previous models are Previous models are mathematically mathematically simplesimple
22
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)
VINT Simulation EnvironmentVINT Simulation Environment was was usedused
NS package allows NS package allows extensionextension by user by user
NS-2NS-2 (network simulator version 2)(network simulator version 2)
NAMNAM (network animator)(network animator)
Components:Components:
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
23
Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)
NS simulator NS simulator
receives a scenario as inputreceives a scenario as input
produces trace filesproduces trace files
NetworkSimulator
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
24
Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)
Our analysis: Scenario in which the Our analysis: Scenario in which the users are speaking, while interfering users are speaking, while interfering FTP traffic is going onFTP traffic is going on
Speech Speech sourcessources
FTP FTP sourcessources
Speech Speech receiversreceivers
FTP FTP receiversreceivers
64 kb/s64 kb/s
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
25
3 ms3 ms1 ms1 ms
Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)
Playout BufferPlayout Buffer required to deal with delay required to deal with delay variationsvariations
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
26
Time
Time
Time
Sender
Buffer
Network delay
Receiver
IP net
Buffer size
Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)
Characteristics of the scenario:Characteristics of the scenario:
Competing traffic:Competing traffic: on/off TCP sources on/off TCP sources
Playout buffer size:Playout buffer size: 100 ms 100 ms
Speech traffic uses Speech traffic uses RTP protocolRTP protocol with with header compression (8-bytes long header compression (8-bytes long packet)packet)
Round-trip time:Round-trip time: 10 ms 10 ms
Simulation:Simulation: 350 s 350 s
Simulated Packet loss ratios: Simulated Packet loss ratios: 5%5% - - 20%20%
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
27
Performance measuresPerformance measuresPerformance measuresPerformance measures
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
28
Word Accuracy is a good measure of performance
Word Accuracy of the baseline system (no errors): 99%
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
29
Three kinds of errors:Three kinds of errors:
Reference:Reference: I want to go to Venezia I want to go to Venezia
Recognized:Recognized: -- want to go to want to go to thethe VeronaVerona
D=Deletion I=InsertionS=Substitution
Word AccuracyWord Accuracy=100=100 S+D+IS+D+I#spoken words#spoken words( )1 _1 _ %
Performance measures (cont’d)Performance measures (cont’d)Performance measures (cont’d)Performance measures (cont’d)
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
30
Error concealment technique for Error concealment technique for packet lossespacket losses
When packet losses occur, the When packet losses occur, the missing packets replaced by missing packets replaced by interpolationinterpolation
Packet loss Packet loss concealmentconcealmentPacket loss Packet loss
concealmentconcealment
Results for random lossesResults for random lossesResults for random lossesResults for random losses
For Packet Loss Ratio =10% and For Packet Loss Ratio =10% and 20%, predominantly 20%, predominantly single packet single packet losseslosses occur occur
Overall, 94% of burst lengths Overall, 94% of burst lengths < 5< 5 packet.spacket.s
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
31
Recognition performance with random losses
75
80
85
90
95
100
10 20 30 40
Packet loss ratio
Word
Acc
ura
cy
(%)
without error concealment with error concealment
Results for random lossesResults for random lossesResults for random lossesResults for random losses
As Packet Loss Ratio increases As Packet Loss Ratio increases performance deterioratesperformance deteriorates
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
32
Recovery Recovery from 83% from 83% to 99%to 99%
Results for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model losses
For Packet Loss Ratio=10% and For Packet Loss Ratio=10% and 20%, predominantly single packet 20%, predominantly single packet losses occurlosses occur
For Packet Loss Ratio=30% and For Packet Loss Ratio=30% and 40%, burst lengths 40%, burst lengths < 6< 6 packets packets
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
33
Results for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model losses
With Packet Loss Ratio=40%,With Packet Loss Ratio=40%,
Average loss burst length: 4 packetsAverage loss burst length: 4 packets
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
34
Recognition performance for Gilbert-model losses
75
80
85
90
95
100
10 20 30 40
Packet loss ratio
Word
Acc
ura
cy
(%)
without Error Concealment with Error ConcealmentRecovery Recovery from 80% from 80% to 98%to 98%
Results for network simulationsResults for network simulationsResults for network simulationsResults for network simulations
TCP packets are much larger than speech ones: when speech packets get delayed in the queues, they may reach the receiver too late
Average burst length: 45 packets. Why?
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
35
Results for network simulationsResults for network simulationsResults for network simulationsResults for network simulations
Loss burst lengths are very largeLoss burst lengths are very large
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
36
Recognition performance for network simulations
50
60
70
80
90
100
10 20 30 40
Packet loss ratio
Word
Accu
racy(%
)
without Error Concealment
With long With long loss loss
bursts, bursts, nothing nothing can be can be donedone
Summary and ConclusionsSummary and ConclusionsSummary and ConclusionsSummary and Conclusions
We have analyzed the impact of packet losses on a DSR system over IP networks using the ETSI Aurora database
Packet losses were modeled by:
network simulations
Gilbert-model losses
random losses
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
37
Summary and Conclusions Summary and Conclusions (cont’d)(cont’d)
Summary and Conclusions Summary and Conclusions (cont’d)(cont’d)
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
38
Expected recognition performance Expected recognition performance from length of burst lossesfrom length of burst losses
Small burst length losses: good Small burst length losses: good recognition resultsrecognition results
Large burst length losses: degraded Large burst length losses: degraded recognition resultsrecognition results
Summary and Conclusions Summary and Conclusions (cont’d)(cont’d)
Summary and Conclusions Summary and Conclusions (cont’d)(cont’d)
Single packet losses and short bursts can be tolerated
Bursty packet losses lead to large performance degradation
Error concealment technique provides good results if the error bursts are short (4-5 packets)
Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions
39
SubmissionSubmissionSubmissionSubmission
Submitted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002
40