Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009

Li DengMicrosoft Research

Redmond, WA Presented at the Banff Workshop, July 2009

From Recognition To Understanding Expanding traditional scope of signal processing

Outline

• Traditional scope of signal processing: “signal” dimension and “processing/task” dimension

• Expansion along both dimensions– “signal” dimension

– “task” dimension

• Case study on the “task” dimension– From speech recognition to speech understanding

• Three benefits for MMSP research

Signal Processing Constitution

• “… The Field of Interest of the Society shall be the theory and application of filtering, coding, transmitting, estimating, detecting, analyzing, recognizing, synthesizing, recording, and reproducing signals by digital or analog devices or techniques. The term ‘signal’ includes audio, video, speech, image, communication, geophysical, sonar, radar, medical, musical, and other signals…” (ARTICLE II)

• Translate to a “matrix”: “Processing type” (row) vs. “Signal type” (column)

4

Scope of SP in a matrix Media type Tasks/Apps

Audio/Music Speech Image/Animation/Graphics

Video Text/Document/Language(s)

Coding AudioCoding

Speech Coding

Image Coding

VideoCoding

DocumentCompression/Summary

Communication(transmit/estim/detect)

Record/Reproducing

Microphone/loud-speaker design Camera

Analysis (filtering, enhance)

De-noising/Source separation

Speech Enhancement/Feature extraction

Image/video enhancement (e.g. clear Type), Segmentation, feature extraction (e.g., SIFT)

Grammar checking, Text Parsing

Synthesis Computer Music

SpeechSynthesis(text-to-speech)

Computer Graphics

Video Synthesis?

NaturalLanguageGeneration

Recognition AuditoryScene Analysis?

Automatic Speech/SpeakerRecognition

Image Recognition (e.g, Opticalcharacterrecognition, facerecognition, finger print rec)

ComputerVision(e.g. 3-D object Recognition)

Text Categorization

Understanding(Semantic IE)

Spoken LanguageUnderstanding(e.g. voice search)

Image Understanding (e.g. scene analysis)

Natural Language Understanding/MT

Retrieval/Mining MusicRetrieval

Spoken Document Retrieval & Voice/Mobile Search

Image Retrieval

Video Search

Text Search(info retrieval)

Social Media Apps Zune, Itune, etc.

PodCasts Photo Sharing (e.g. flickr)

Video Sharing(e.g. Youtube, 3D Second Life)

Blogs, Wiki, del.ici.ous…

5

Scope of SP in a matrix (expanded) Media type Tasks/Apps

Audio/MusicAcoustics

Speech Image/Animation/Graphics

Video Text/Document/Language(s)

Coding/Compression

AudioCoding

Speech Coding

Image Coding

VideoCoding

DocumentCompression/Summary

Communication MIMO; Voice over IP, DAB/DVB, IP-TV Home Network; Wireless?

Security/forensics

Multimedia watermarking, encryption, etc.

Enhancement/Analysis

De-noising/Source separation

Speech Enhancement/Feature extraction

Image/video enhancement, Segmentation, feature extraction (e.g., SIFT,SURF),computational photography

Grammar checking, Text Parsing

Synthesis/Rendering

Computer Music

SpeechSynthesis(text-to-speech)

Computer Graphics

Video Synthesis

NaturalLanguageGeneration

User-Interface Multi-Modal Human Computer Interaction (HCI --- Input Methods) /Dialog?

Recognition

/Verification-detection

AuditoryScene Analysis

Machine hearing?

(Computer audition; e.g. Melody detection & Singer ID, etc.)?

Automatic Speech/SpeakerRecognition

Image Recognition (e.g, Opticalcharacterrecognition, facerecognition, finger print rec)

ComputerVision(e.g. 3-D object Recognition;

“story telling” from video, etc.)

Text Categorization

Understanding(Semantic IE)

Spoken LanguageUnderstanding(e.g. HMIHY)

Image Understanding (e.g. scene analysis) ?

Natural Language Understanding/MT

Retrieval/Mining

MusicRetrieval

Spoken Document Retrieval & Voice/Mobile Search

Image Retrieval (CBIR)

Video Search

Text Search(info retrieval)

Social Media Apps

Itune, etc. PodCasts Photo Sharing (e.g. flickr)

Video Sharing(e.g. Youtube, 3D Second Life)

Blogs, Wiki, del.ici.ous…

Speech Understanding: Case Study(Yaman, Deng, Yu, Acero: IEEE Trans ASLP, 2008)

• Speech understanding: not to get “words” but to get “meaning/semantics” (actionable by the system)

• Speech utterance classification as a simple form of speech “understanding”

• Case study: ATIS domain (Airline Travel Info System)

• “Understanding”: want to book a flight? or get info about ground transportation in SEA?

Traditional Approach to Speech Understanding/Classification

rX ˆrCˆ

rWAutomatic Speech

Recognizer

Semantic Classifier

Acoustic Model

Language Model

Classifier Model

Feature Functions

ˆ arg max ( | )r

r r rC

C P C XFind the most likely semantic class for the rth acoustic signal

ˆ arg max ( | )r rW

W P W X

ˆ ˆarg max ( | )r

r r rC

C P C W1st Stage: Speech recognition

2nd Stage: Semantic classification

Traditional/New Approach

• Word error rate minimized in the 1st stage,• Understanding error rate minimized in the

2nd stage. • Lower word errors do not necessarily

mean better understanding.

• The new approach: integrate the two stages so that the overall “understanding” errors are minimized.

New Approach: Integrated Design

Key Components:• Discriminative Training• N-best List Rescoring• Iterative Update of Parameters

rX ˆrCAutomatic

Speech Recognizer

Semantic Classifier &LM Training

Acoustic Model

Language Model

Classifier Model

Feature Functions

N-best List Rescoring using

0

...

...

r

Nr

W

W( , ; )r r rD C W X

N-bestList

ˆ ( | , ) ( | ) ( )arg max

ˆ ( | ) ( | ) ( )arg max max

C W

W N best listC

C P C W X P X W P W

C P C W P X W P W

Classification Decision Rule using N-Best List

1

( , , ) log[ ( | ) ( | ) ( )]Lr rD C W X P C W P X W P W

Approximating the classification decision rule ˆ ( | )arg max rC

C P C X

ˆ ( , , )arg max max rWC

over Nbest list

C D C W X

Integrative Score

sum over all possible W

maximize over W in the N-best list

An Illustrative Example

best score, but wrong classbest sentence to yield the correct class,

but low score

Minimizing the Misclassifications

( )

1( ( ))

1 r rr r r d Xd X

e

1η

N0 0 n n

r r r r r r r rn=1

1d (X )= - D(C ,W ,X )+log exp η D(C ,W ,X )

N

The misclassification function:

The loss function associated with the misclassification function:

1

min ( ) ( ( ; ))R

r r rr

L d X

Minimize the misclassifications:

Discriminative Training of Language Model Parameters

Find the language model probabilities

Wr

r r r W L( )= l d (W ;Λ )

Count of the bigram in the word string of the nth competitive class

( t )

W 1n n

N r r W(t+1) 0 nNW r r r r r x y r x ym mn=1

r r Wm=1

exp η D(C ,W ;Λ )Λ =Λ l d l d -I(W ,w w )+ ×I(W ,w w )

exp η D(C ,W ;Λ )

Count of the bigram in the word string of the correct class

x yw w x yP =log P(w |w )

to minimize the total classification loss

weighting factor

Discriminative Training of Semantic Classifier Parameters

"

" "( , ) log ( | )( , ) ( , ) ( | ) ( , )

j j j jj j j j j jr r r r

r r k r r r k rCk k

D C W P C WC W f C W P C W f C W

j j

N r r( t ) 0 0 j jNW r r r rm mj 1

r rm 1

exp D(C ,W ,X)1 (C ,W ) (C ,W )

exp D(C ,W ,X)

(t+1)

W r r r r

= l d l d

Find the classifier model parameters

Wr

r r r W L( )= l d (W ;Λ )

to minimize the total classification loss

weighting factor

Setup for the Experiments

• ATIS II+III data is used:– 5798 training wave files– 914 test wave files– 410 development wave files

(used for parameter tuning & stopping criteria)

• Microsoft SAPI 6.1 speech recognizer is used.

• MCE classifiers are built on top of max-entropy classifiers.

• ASR transcription:One-best matching sentence, W.

• Classifier Training:Max-entropy classifiers using one-best ASR transcription.

• Classifier Testing:Max-entropy classifiers using one-best ASR transcription.

Test WER (%) Test CER (%)

Manual Transcription 0.00 4.81

ASR Output 4.82 4.92

Experiments: Baseline System Performance

Experimental Results

One iteration of training consists of:

SAPI SR

DiscriminativeLM Training

Discriminative Classifier Training

CERMax-Entropy ClassifierTraining

Speech Utterance

From Recognition to Understanding

• This case study illustrates that joint design of “recognition” and “understanding” components are beneficial

• Drawn from speech research area

• Speech translation has similar conclusion?

• Case studies from image/video research areas? Image recognition/understanding?

Summary

• The “matrix” view of signal processing – “signal type” as the column– “Task type” as the row

• Benefit 1: Natural extension of the “row” elements (e.g., text/language) & of “column” (e.g., understanding)

• Benefit 2: Cross-column breeding: e.g., Can speech/audio and image/video recognition researchers learn from each other in terms of machine learning & SP techniques (similarities & differences)?

• Benefit 3: Cross-row breeding: e.g., Given the trend from speech recognition to understanding (& the kind of approach in the case study), what can we say about image/video and other media understanding?

Documents

Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009