40
A Simulative Study Of A Simulative Study Of Distributed Speech Recognition Distributed Speech Recognition Over Internet Protocol Networks Over Internet Protocol Networks MS Thesis Defense MS Thesis Defense December 6, 2001 December 6, 2001 University of Illinois at Chicago University of Illinois at Chicago Politecnico di Torino Politecnico di Torino Daniele Quercia Daniele Quercia [email protected] [email protected]

A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Embed Size (px)

DESCRIPTION

A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks. MS Thesis Defense December 6, 2001 University of Illinois at Chicago Politecnico di Torino. Daniele Quercia [email protected]. Distributed Speech Recognition: Our Focus. Experimental framework. - PowerPoint PPT Presentation

Citation preview

Page 1: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

A Simulative Study Of A Simulative Study Of

Distributed Speech Recognition Distributed Speech Recognition

Over Internet Protocol NetworksOver Internet Protocol Networks

A Simulative Study Of A Simulative Study Of

Distributed Speech Recognition Distributed Speech Recognition

Over Internet Protocol NetworksOver Internet Protocol Networks

MS Thesis Defense MS Thesis Defense

December 6, 2001December 6, 2001

University of Illinois at Chicago University of Illinois at Chicago

Politecnico di TorinoPolitecnico di Torino

MS Thesis Defense MS Thesis Defense

December 6, 2001December 6, 2001

University of Illinois at Chicago University of Illinois at Chicago

Politecnico di TorinoPolitecnico di Torino

Daniele QuerciaDaniele [email protected]@studenti.to.it

Page 2: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

OutlineOutlineOutlineOutline

Distributed Speech Recognition: Distributed Speech Recognition: Our FocusOur Focus

Experimental frameworkExperimental framework

Experimental resultsExperimental results

Conclusions: the impact of packet Conclusions: the impact of packet losseslosses

2

Page 3: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Internet has proliferated rapidlyInternet has proliferated rapidly

IntroductionIntroductionIntroductionIntroduction

Strong interest in transporting Strong interest in transporting voice over IP networksvoice over IP networks

Novel Internet applications can Novel Internet applications can benefit from Automatic Speech benefit from Automatic Speech Recognition (ASR)Recognition (ASR)

3

Page 4: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Introduction (cont’d)Introduction (cont’d)Introduction (cont’d)Introduction (cont’d)

Desire for speech input for hand-Desire for speech input for hand-held devices (mobile phones, held devices (mobile phones, PDA’s, etc.)PDA’s, etc.)

Speech recognition requires high Speech recognition requires high computation, RAM, and disk computation, RAM, and disk resourcesresources

If hand-held devices connected to If hand-held devices connected to a network, the speech recognition a network, the speech recognition can take place remotely (e.g. ETSI can take place remotely (e.g. ETSI Aurora Project)Aurora Project)

4

Page 5: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Speech recognition task distributed Speech recognition task distributed between two end systems:between two end systems:

client side client side (light-weight)(light-weight)

server side server side

Distributed Speech Distributed Speech Recognition ArchitectureRecognition Architecture

Distributed Speech Distributed Speech Recognition ArchitectureRecognition Architecture

Client SideIP

Network

Server Side

5

Page 6: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Distributed Speech Recognition Distributed Speech Recognition Architecture (cont’d)Architecture (cont’d)

Distributed Speech Recognition Distributed Speech Recognition Architecture (cont’d)Architecture (cont’d)

6

Front-end Packing and Framing

UnpackingRecognizer

IP Network

Client Device

Remote site

Page 7: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Technical challengesTechnical challengesTechnical challengesTechnical challenges

IP networks are not designed for IP networks are not designed for transmitting real-time traffictransmitting real-time traffic

Lack of guarantees in terms of Lack of guarantees in terms of packet losses,packet losses, network delay,network delay, and and delay jitterdelay jitter

7

Page 8: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Technical challenges (cont’d)Technical challenges (cont’d)Technical challenges (cont’d)Technical challenges (cont’d)

The design of Distributed Speech The design of Distributed Speech Recognition systems must consider Recognition systems must consider the effect of the effect of packet lossespacket losses

8

without significant losseswithout significant losses

Speech packets must be received:Speech packets must be received:

with low delaywith low delay

with small delay variation (jitter)with small delay variation (jitter)

Page 9: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Our FocusOur FocusOur FocusOur Focus

Performance evaluationPerformance evaluation of a Distributed Speech of a Distributed Speech Recognition (DSR) system Recognition (DSR) system operating over simulated operating over simulated IP networksIP networks

9

Page 10: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Our Research Our Research ContributionsContributionsOur Research Our Research ContributionsContributions

Evaluation of a standard Evaluation of a standard front-endfront-end that achieves state-of-the-art that achieves state-of-the-art performanceperformance

Simulative study of DSR under Simulative study of DSR under increasingly more realist scenarios:increasingly more realist scenarios:

Random losses Random losses

Gilbert-model lossesGilbert-model losses

Network simulationsNetwork simulations

10

Page 11: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Experimental frameworkExperimental frameworkExperimental frameworkExperimental framework

11

Front-endFront-end

Speech DatabaseSpeech Database

RecognizerRecognizer

Network scenariosNetwork scenarios

Page 12: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Front-endFront-endFront-endFront-endExperimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

12

Front-end extracts from the speech Front-end extracts from the speech signal significant information for signal significant information for recognition recognition

Spectral EnvelopeSpectral Envelope

Page 13: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Front-end (cont’d)Front-end (cont’d)Front-end (cont’d)Front-end (cont’d)

ETSI Aurora Standard Front-end ETSI Aurora Standard Front-end produces 14 coefficientsproduces 14 coefficients

Front-end based on mel coefficients

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

13

Frequency axis is warped closer to Frequency axis is warped closer to perceptual axisperceptual axis

Mel coefficients represent the short-time spectral envelope

Page 14: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

RecognizerRecognizerRecognizerRecognizer

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

14

HMM-based speech recognizerHMM-based speech recognizer

HMM model consists of:HMM model consists of:

-transition probabilities a-transition probabilities aijij

-emission distribution b-emission distribution bii(o)(o)

16 states per word16 states per word

3 Gaussian mixtures per 3 Gaussian mixtures per statestate

-states q-states qii

-initial state distribution -initial state distribution ii

Page 15: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

P(O|M 1)

Recognizer (cont’d)Recognizer (cont’d)Recognizer (cont’d)Recognizer (cont’d)Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

15

Word 1 Word 3Word 2

M 1 M 2 M 3

TRAINING PHASE

RECOGNITION PHASE

Training examples

Estimated Models

Unknown observation sequence O

P(O|M 3)P(O|M 2)

Evaluated probabilitiesMax Probability chosen

Page 16: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Speech DatabaseSpeech DatabaseSpeech DatabaseSpeech Database

For For trainingtraining, 8440 utterances , 8440 utterances selectedselected

ETSI Aurora ETSI Aurora TIdigitsTIdigits Database 2.0 Database 2.0

For For testtest, 4004 utterances selected , 4004 utterances selected without noise addedwithout noise added

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

16

Page 17: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Network scenariosNetwork scenariosNetwork scenariosNetwork scenarios

Random lossesRandom losses

3 network scenarios was considered:3 network scenarios was considered:

Gilbert-model lossesGilbert-model losses

Network simulationsNetwork simulations

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

17

1 frame per packet1 frame per packet

Page 18: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Packet loss ratios: Packet loss ratios: 10%10% - - 40%40%

Random lossesRandom lossesRandom lossesRandom losses

Each packet has the same loss Each packet has the same loss probabilityprobability

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

18

Page 19: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Random losses (cont’d)Random losses (cont’d)Random losses (cont’d)Random losses (cont’d)

t t +

When the net is congested ...When the net is congested ...

packet packet lossloss

High Probability(packet High Probability(packet loss) !!loss) !!

Generally, packet losses appear in Generally, packet losses appear in burstburst

time

Random losses does not model the Random losses does not model the temporal dependenciestemporal dependencies of loss of loss

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

19

Page 20: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Gilbert-model lossesGilbert-model lossesGilbert-model lossesGilbert-model losses

2-state Markov model:2-state Markov model:

pp = P(next packet lost | previous packet = P(next packet lost | previous packet arrived)arrived)

qq = P(next packet arrived | previous = P(next packet arrived | previous packet lost)packet lost)

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

STATE 1 no loss

STATE 2 loss

1-p

q

1-q

p

20

Page 21: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Gilbert-model losses (cont’d)Gilbert-model losses (cont’d)Gilbert-model losses (cont’d)Gilbert-model losses (cont’d)

Simulated Packet loss ratios: Simulated Packet loss ratios: 10%10% - - 40%40%

2-state Markov model is 2-state Markov model is less less accurateaccurate than a than a nthnth order Markov order Markov model, but (accuracy vs. model, but (accuracy vs. complexity) is better.complexity) is better.

Documented in the literature: Documented in the literature: “Gilbert model is a suitable “Gilbert model is a suitable loss loss modelmodel””

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

21

Page 22: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Network simulationsNetwork simulationsNetwork simulationsNetwork simulations

Network simulations represent more Network simulations represent more realistic IP scenarios realistic IP scenarios

Previous models are Previous models are mathematically mathematically simplesimple

22

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

Page 23: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)

VINT Simulation EnvironmentVINT Simulation Environment was was usedused

NS package allows NS package allows extensionextension by user by user

NS-2NS-2 (network simulator version 2)(network simulator version 2)

NAMNAM (network animator)(network animator)

Components:Components:

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

23

Page 24: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)

NS simulator NS simulator

receives a scenario as inputreceives a scenario as input

produces trace filesproduces trace files

NetworkSimulator

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

24

Page 25: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)

Our analysis: Scenario in which the Our analysis: Scenario in which the users are speaking, while interfering users are speaking, while interfering FTP traffic is going onFTP traffic is going on

Speech Speech sourcessources

FTP FTP sourcessources

Speech Speech receiversreceivers

FTP FTP receiversreceivers

64 kb/s64 kb/s

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

25

3 ms3 ms1 ms1 ms

Page 26: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)

Playout BufferPlayout Buffer required to deal with delay required to deal with delay variationsvariations

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

26

Time

Time

Time

Sender

Buffer

Network delay

Receiver

IP net

Buffer size

Page 27: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)

Characteristics of the scenario:Characteristics of the scenario:

Competing traffic:Competing traffic: on/off TCP sources on/off TCP sources

Playout buffer size:Playout buffer size: 100 ms 100 ms

Speech traffic uses Speech traffic uses RTP protocolRTP protocol with with header compression (8-bytes long header compression (8-bytes long packet)packet)

Round-trip time:Round-trip time: 10 ms 10 ms

Simulation:Simulation: 350 s 350 s

Simulated Packet loss ratios: Simulated Packet loss ratios: 5%5% - - 20%20%

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

27

Page 28: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Performance measuresPerformance measuresPerformance measuresPerformance measures

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

28

Word Accuracy is a good measure of performance

Word Accuracy of the baseline system (no errors): 99%

Page 29: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

29

Three kinds of errors:Three kinds of errors:

Reference:Reference: I want to go to Venezia I want to go to Venezia

Recognized:Recognized: -- want to go to want to go to thethe VeronaVerona

D=Deletion I=InsertionS=Substitution

Word AccuracyWord Accuracy=100=100 S+D+IS+D+I#spoken words#spoken words( )1 _1 _ %

Performance measures (cont’d)Performance measures (cont’d)Performance measures (cont’d)Performance measures (cont’d)

Page 30: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

30

Error concealment technique for Error concealment technique for packet lossespacket losses

When packet losses occur, the When packet losses occur, the missing packets replaced by missing packets replaced by interpolationinterpolation

Packet loss Packet loss concealmentconcealmentPacket loss Packet loss

concealmentconcealment

Page 31: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Results for random lossesResults for random lossesResults for random lossesResults for random losses

For Packet Loss Ratio =10% and For Packet Loss Ratio =10% and 20%, predominantly 20%, predominantly single packet single packet losseslosses occur occur

Overall, 94% of burst lengths Overall, 94% of burst lengths < 5< 5 packet.spacket.s

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

31

Page 32: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Recognition performance with random losses

75

80

85

90

95

100

10 20 30 40

Packet loss ratio

Word

Acc

ura

cy

(%)

without error concealment with error concealment

Results for random lossesResults for random lossesResults for random lossesResults for random losses

As Packet Loss Ratio increases As Packet Loss Ratio increases performance deterioratesperformance deteriorates

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

32

Recovery Recovery from 83% from 83% to 99%to 99%

Page 33: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Results for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model losses

For Packet Loss Ratio=10% and For Packet Loss Ratio=10% and 20%, predominantly single packet 20%, predominantly single packet losses occurlosses occur

For Packet Loss Ratio=30% and For Packet Loss Ratio=30% and 40%, burst lengths 40%, burst lengths < 6< 6 packets packets

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

33

Page 34: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Results for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model losses

With Packet Loss Ratio=40%,With Packet Loss Ratio=40%,

Average loss burst length: 4 packetsAverage loss burst length: 4 packets

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

34

Recognition performance for Gilbert-model losses

75

80

85

90

95

100

10 20 30 40

Packet loss ratio

Word

Acc

ura

cy

(%)

without Error Concealment with Error ConcealmentRecovery Recovery from 80% from 80% to 98%to 98%

Page 35: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Results for network simulationsResults for network simulationsResults for network simulationsResults for network simulations

TCP packets are much larger than speech ones: when speech packets get delayed in the queues, they may reach the receiver too late

Average burst length: 45 packets. Why?

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

35

Page 36: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Results for network simulationsResults for network simulationsResults for network simulationsResults for network simulations

Loss burst lengths are very largeLoss burst lengths are very large

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

36

Recognition performance for network simulations

50

60

70

80

90

100

10 20 30 40

Packet loss ratio

Word

Accu

racy(%

)

without Error Concealment

With long With long loss loss

bursts, bursts, nothing nothing can be can be donedone

Page 37: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Summary and ConclusionsSummary and ConclusionsSummary and ConclusionsSummary and Conclusions

We have analyzed the impact of packet losses on a DSR system over IP networks using the ETSI Aurora database

Packet losses were modeled by:

network simulations

Gilbert-model losses

random losses

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

37

Page 38: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Summary and Conclusions Summary and Conclusions (cont’d)(cont’d)

Summary and Conclusions Summary and Conclusions (cont’d)(cont’d)

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

38

Expected recognition performance Expected recognition performance from length of burst lossesfrom length of burst losses

Small burst length losses: good Small burst length losses: good recognition resultsrecognition results

Large burst length losses: degraded Large burst length losses: degraded recognition resultsrecognition results

Page 39: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

Summary and Conclusions Summary and Conclusions (cont’d)(cont’d)

Summary and Conclusions Summary and Conclusions (cont’d)(cont’d)

Single packet losses and short bursts can be tolerated

Bursty packet losses lead to large performance degradation

Error concealment technique provides good results if the error bursts are short (4-5 packets)

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

39

Page 40: A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

SubmissionSubmissionSubmissionSubmission

Submitted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002

40