A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks

A Simulative Study Of A Simulative Study Of

Distributed Speech Recognition Distributed Speech Recognition

Over Internet Protocol NetworksOver Internet Protocol Networks

A Simulative Study Of A Simulative Study Of

Distributed Speech Recognition Distributed Speech Recognition

Over Internet Protocol NetworksOver Internet Protocol Networks

MS Thesis Defense MS Thesis Defense

December 6, 2001December 6, 2001

University of Illinois at Chicago University of Illinois at Chicago

Politecnico di TorinoPolitecnico di Torino

MS Thesis Defense MS Thesis Defense

December 6, 2001December 6, 2001

University of Illinois at Chicago University of Illinois at Chicago

Politecnico di TorinoPolitecnico di Torino

Daniele QuerciaDaniele [email protected]@studenti.to.it

OutlineOutlineOutlineOutline

Distributed Speech Recognition: Distributed Speech Recognition: Our FocusOur Focus

Experimental frameworkExperimental framework

Experimental resultsExperimental results

Conclusions: the impact of packet Conclusions: the impact of packet losseslosses

2

Internet has proliferated rapidlyInternet has proliferated rapidly

IntroductionIntroductionIntroductionIntroduction

Strong interest in transporting Strong interest in transporting voice over IP networksvoice over IP networks

Novel Internet applications can Novel Internet applications can benefit from Automatic Speech benefit from Automatic Speech Recognition (ASR)Recognition (ASR)

3

Introduction (cont’d)Introduction (cont’d)Introduction (cont’d)Introduction (cont’d)

Desire for speech input for hand-Desire for speech input for hand-held devices (mobile phones, held devices (mobile phones, PDA’s, etc.)PDA’s, etc.)

Speech recognition requires high Speech recognition requires high computation, RAM, and disk computation, RAM, and disk resourcesresources

If hand-held devices connected to If hand-held devices connected to a network, the speech recognition a network, the speech recognition can take place remotely (e.g. ETSI can take place remotely (e.g. ETSI Aurora Project)Aurora Project)

4

Speech recognition task distributed Speech recognition task distributed between two end systems:between two end systems:

client side client side (light-weight)(light-weight)

server side server side

Distributed Speech Distributed Speech Recognition ArchitectureRecognition Architecture

Distributed Speech Distributed Speech Recognition ArchitectureRecognition Architecture

Client SideIP

Network

Server Side

5

Distributed Speech Recognition Distributed Speech Recognition Architecture (cont’d)Architecture (cont’d)

Distributed Speech Recognition Distributed Speech Recognition Architecture (cont’d)Architecture (cont’d)

6

Front-end Packing and Framing

UnpackingRecognizer

IP Network

Client Device

Remote site

Technical challengesTechnical challengesTechnical challengesTechnical challenges

IP networks are not designed for IP networks are not designed for transmitting real-time traffictransmitting real-time traffic

Lack of guarantees in terms of Lack of guarantees in terms of packet losses,packet losses, network delay,network delay, and and delay jitterdelay jitter

7

Technical challenges (cont’d)Technical challenges (cont’d)Technical challenges (cont’d)Technical challenges (cont’d)

The design of Distributed Speech The design of Distributed Speech Recognition systems must consider Recognition systems must consider the effect of the effect of packet lossespacket losses

8

without significant losseswithout significant losses

Speech packets must be received:Speech packets must be received:

with low delaywith low delay

with small delay variation (jitter)with small delay variation (jitter)

Our FocusOur FocusOur FocusOur Focus

Performance evaluationPerformance evaluation of a Distributed Speech of a Distributed Speech Recognition (DSR) system Recognition (DSR) system operating over simulated operating over simulated IP networksIP networks

9

Our Research Our Research ContributionsContributionsOur Research Our Research ContributionsContributions

Evaluation of a standard Evaluation of a standard front-endfront-end that achieves state-of-the-art that achieves state-of-the-art performanceperformance

Simulative study of DSR under Simulative study of DSR under increasingly more realist scenarios:increasingly more realist scenarios:

Random losses Random losses

Gilbert-model lossesGilbert-model losses

Network simulationsNetwork simulations

10

Experimental frameworkExperimental frameworkExperimental frameworkExperimental framework

11

Front-endFront-end

Speech DatabaseSpeech Database

RecognizerRecognizer

Network scenariosNetwork scenarios

Front-endFront-endFront-endFront-endExperimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

12

Front-end extracts from the speech Front-end extracts from the speech signal significant information for signal significant information for recognition recognition

Spectral EnvelopeSpectral Envelope

Front-end (cont’d)Front-end (cont’d)Front-end (cont’d)Front-end (cont’d)

ETSI Aurora Standard Front-end ETSI Aurora Standard Front-end produces 14 coefficientsproduces 14 coefficients

Front-end based on mel coefficients

Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

13

Frequency axis is warped closer to Frequency axis is warped closer to perceptual axisperceptual axis

Mel coefficients represent the short-time spectral envelope

RecognizerRecognizerRecognizerRecognizer


14

HMM-based speech recognizerHMM-based speech recognizer

HMM model consists of:HMM model consists of:

-transition probabilities a-transition probabilities aijij

-emission distribution b-emission distribution bii(o)(o)

16 states per word16 states per word

3 Gaussian mixtures per 3 Gaussian mixtures per statestate

-states q-states qii

-initial state distribution -initial state distribution ii

P(O|M 1)

Recognizer (cont’d)Recognizer (cont’d)Recognizer (cont’d)Recognizer (cont’d)Experimental frameworkExperimental frameworkResultsResultsSummary & ConclusionsSummary & Conclusions

15

Word 1 Word 3Word 2

M 1 M 2 M 3

TRAINING PHASE

RECOGNITION PHASE

Training examples

Estimated Models

Unknown observation sequence O

P(O|M 3)P(O|M 2)

Evaluated probabilitiesMax Probability chosen

Speech DatabaseSpeech DatabaseSpeech DatabaseSpeech Database

For For trainingtraining, 8440 utterances , 8440 utterances selectedselected

ETSI Aurora ETSI Aurora TIdigitsTIdigits Database 2.0 Database 2.0

For For testtest, 4004 utterances selected , 4004 utterances selected without noise addedwithout noise added


16

Network scenariosNetwork scenariosNetwork scenariosNetwork scenarios

Random lossesRandom losses

3 network scenarios was considered:3 network scenarios was considered:

Gilbert-model lossesGilbert-model losses

Network simulationsNetwork simulations


17

1 frame per packet1 frame per packet

Packet loss ratios: Packet loss ratios: 10%10% - - 40%40%

Random lossesRandom lossesRandom lossesRandom losses

Each packet has the same loss Each packet has the same loss probabilityprobability


18

Random losses (cont’d)Random losses (cont’d)Random losses (cont’d)Random losses (cont’d)

t t +

When the net is congested ...When the net is congested ...

packet packet lossloss

High Probability(packet High Probability(packet loss) !!loss) !!

Generally, packet losses appear in Generally, packet losses appear in burstburst

time

Random losses does not model the Random losses does not model the temporal dependenciestemporal dependencies of loss of loss


19

Gilbert-model lossesGilbert-model lossesGilbert-model lossesGilbert-model losses

2-state Markov model:2-state Markov model:

pp = P(next packet lost | previous packet = P(next packet lost | previous packet arrived)arrived)

qq = P(next packet arrived | previous = P(next packet arrived | previous packet lost)packet lost)


STATE 1 no loss

STATE 2 loss

1-p

q

1-q

p

20

Gilbert-model losses (cont’d)Gilbert-model losses (cont’d)Gilbert-model losses (cont’d)Gilbert-model losses (cont’d)

Simulated Packet loss ratios: Simulated Packet loss ratios: 10%10% - - 40%40%

2-state Markov model is 2-state Markov model is less less accurateaccurate than a than a nthnth order Markov order Markov model, but (accuracy vs. model, but (accuracy vs. complexity) is better.complexity) is better.

Documented in the literature: Documented in the literature: “Gilbert model is a suitable “Gilbert model is a suitable loss loss modelmodel””


21

Network simulationsNetwork simulationsNetwork simulationsNetwork simulations

Network simulations represent more Network simulations represent more realistic IP scenarios realistic IP scenarios

Previous models are Previous models are mathematically mathematically simplesimple

22


Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)Network simulations (cont’d)

VINT Simulation EnvironmentVINT Simulation Environment was was usedused

NS package allows NS package allows extensionextension by user by user

NS-2NS-2 (network simulator version 2)(network simulator version 2)

NAMNAM (network animator)(network animator)

Components:Components:


23


NS simulator NS simulator

receives a scenario as inputreceives a scenario as input

produces trace filesproduces trace files

NetworkSimulator


24


Our analysis: Scenario in which the Our analysis: Scenario in which the users are speaking, while interfering users are speaking, while interfering FTP traffic is going onFTP traffic is going on

Speech Speech sourcessources

FTP FTP sourcessources

Speech Speech receiversreceivers

FTP FTP receiversreceivers

64 kb/s64 kb/s


25

3 ms3 ms1 ms1 ms


Playout BufferPlayout Buffer required to deal with delay required to deal with delay variationsvariations


26

Time

Time

Time

Sender

Buffer

Network delay

Receiver

IP net

Buffer size


Characteristics of the scenario:Characteristics of the scenario:

Competing traffic:Competing traffic: on/off TCP sources on/off TCP sources

Playout buffer size:Playout buffer size: 100 ms 100 ms

Speech traffic uses Speech traffic uses RTP protocolRTP protocol with with header compression (8-bytes long header compression (8-bytes long packet)packet)

Round-trip time:Round-trip time: 10 ms 10 ms

Simulation:Simulation: 350 s 350 s

Simulated Packet loss ratios: Simulated Packet loss ratios: 5%5% - - 20%20%


27

Performance measuresPerformance measuresPerformance measuresPerformance measures


28

Word Accuracy is a good measure of performance

Word Accuracy of the baseline system (no errors): 99%


29

Three kinds of errors:Three kinds of errors:

Reference:Reference: I want to go to Venezia I want to go to Venezia

Recognized:Recognized: -- want to go to want to go to thethe VeronaVerona

D=Deletion I=InsertionS=Substitution

Word AccuracyWord Accuracy=100=100 S+D+IS+D+I#spoken words#spoken words( )1 _1 _ %

Performance measures (cont’d)Performance measures (cont’d)Performance measures (cont’d)Performance measures (cont’d)


30

Error concealment technique for Error concealment technique for packet lossespacket losses

When packet losses occur, the When packet losses occur, the missing packets replaced by missing packets replaced by interpolationinterpolation

Packet loss Packet loss concealmentconcealmentPacket loss Packet loss

concealmentconcealment

Results for random lossesResults for random lossesResults for random lossesResults for random losses

For Packet Loss Ratio =10% and For Packet Loss Ratio =10% and 20%, predominantly 20%, predominantly single packet single packet losseslosses occur occur

Overall, 94% of burst lengths Overall, 94% of burst lengths < 5< 5 packet.spacket.s


31

Recognition performance with random losses

75

80

85

90

95

100

10 20 30 40

Packet loss ratio

Word

Acc

ura

cy

(%)

without error concealment with error concealment

Results for random lossesResults for random lossesResults for random lossesResults for random losses

As Packet Loss Ratio increases As Packet Loss Ratio increases performance deterioratesperformance deteriorates


32

Recovery Recovery from 83% from 83% to 99%to 99%

Results for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model losses

For Packet Loss Ratio=10% and For Packet Loss Ratio=10% and 20%, predominantly single packet 20%, predominantly single packet losses occurlosses occur

For Packet Loss Ratio=30% and For Packet Loss Ratio=30% and 40%, burst lengths 40%, burst lengths < 6< 6 packets packets


33

Results for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model lossesResults for Gilbert-model losses

With Packet Loss Ratio=40%,With Packet Loss Ratio=40%,

Average loss burst length: 4 packetsAverage loss burst length: 4 packets


34

Recognition performance for Gilbert-model losses

75

80

85

90

95

100

10 20 30 40

Packet loss ratio

Word

Acc

ura

cy

(%)

without Error Concealment with Error ConcealmentRecovery Recovery from 80% from 80% to 98%to 98%

Results for network simulationsResults for network simulationsResults for network simulationsResults for network simulations

TCP packets are much larger than speech ones: when speech packets get delayed in the queues, they may reach the receiver too late

Average burst length: 45 packets. Why?


35

Results for network simulationsResults for network simulationsResults for network simulationsResults for network simulations

Loss burst lengths are very largeLoss burst lengths are very large


36

Recognition performance for network simulations

50

60

70

80

90

100

10 20 30 40

Packet loss ratio

Word

Accu

racy(%

)

without Error Concealment

With long With long loss loss

bursts, bursts, nothing nothing can be can be donedone

Summary and ConclusionsSummary and ConclusionsSummary and ConclusionsSummary and Conclusions

We have analyzed the impact of packet losses on a DSR system over IP networks using the ETSI Aurora database

Packet losses were modeled by:

network simulations

Gilbert-model losses

random losses


37

Summary and Conclusions Summary and Conclusions (cont’d)(cont’d)



38

Expected recognition performance Expected recognition performance from length of burst lossesfrom length of burst losses

Small burst length losses: good Small burst length losses: good recognition resultsrecognition results

Large burst length losses: degraded Large burst length losses: degraded recognition resultsrecognition results



Single packet losses and short bursts can be tolerated

Bursty packet losses lead to large performance degradation

Error concealment technique provides good results if the error bursts are short (4-5 packets)


39

SubmissionSubmissionSubmissionSubmission

Submitted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002

40

Documents

A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks