44
© 2002 IBM Corporation http://w3.ibm.com/ibm/presentations IBM Research: Multimedia Mining AR Project http://www.research.ibm.com/AVSTG | May 1 2003 Information Fusion from Multiple Modalities for Multimedia Mining Applications Giridharan Iyengar (joint work w/ H. Nock) Audio-Visual Speech Technologies Human Language Technologies IBM TJ Watson Research Center

Information Fusion from Multiple Modalities for Multimedia Mining Applications

  • Upload
    kory

  • View
    22

  • Download
    0

Embed Size (px)

DESCRIPTION

Information Fusion from Multiple Modalities for Multimedia Mining Applications. Giridharan Iyengar (joint work w/ H. Nock) Audio-Visual Speech Technologies Human Language Technologies IBM TJ Watson Research Center. Acknowledgements. John R. Smith and the IBM Video TREC team - PowerPoint PPT Presentation

Citation preview

Page 1: Information Fusion from Multiple Modalities for Multimedia Mining Applications

© 2002 IBM Corporation

IBM Research: Multimedia Mining AR Project

http://www.research.ibm.com/AVSTG | May 1 2003

Information Fusion from Multiple Modalities for Multimedia Mining Applications

Giridharan Iyengar (joint work w/ H. Nock)Audio-Visual Speech TechnologiesHuman Language TechnologiesIBM TJ Watson Research Center

Page 2: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation2

Acknowledgements

John R. Smith and the IBM Video TREC teamMembers of Human Language Technologies

department

Page 3: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation3

Outline

Project OverviewTREC 2002Semantic Modeling using Multiple

modalities Concept Detection

Special detector – AV Synchrony

RetrievalSummary

Page 4: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation4

What do we want?

A framework that facilitates learning and detection of concepts in digital media: Given a concept and annotated example(s), learn

representation for concept in digital media Learn models directly using statistical methods

(e.g. Face detector) Leverage existing concepts and learn mapping of

the new concept to existing concepts (people spending leisure time in the beach)

Given digital media, detect instances of concepts

Page 5: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation5

Our Thesis

Combining modalities to robustly infer knowledge

Page 6: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation6

Performance dimensions

Con

cep

t A

cq

uis

itio

n

Concept AccuracyConce

pt Covera

ge

Dimensions are inter-related

Concept Accuracy: What is the accuracy?

What accuracy desired for acquiring new concepts?

Concept Acquisition: How many training

examples for a desired level of accuracy? For a general concept? For a specific concept?

Concept Coverage: How many concepts?

Page 7: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation7

Multimedia Semantic Learning Framework

Fusion

Retrieval

AudioFeatures(MFCC…)

VisualFeatures(Color…)

VisualSegmentation

Annotation*(MPEG-7)

Video Repository

Non-speechAudio Models

SpeechModels

VisualModels

training

modelsfeatures

*Available from Alphaworks

Signals Features Semantics Subjective

Recentpast

Nearfuture

Today Our Goal

Page 8: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation8

Multimodal Video Annotation tool (in Alphaworks)

MPEG1 in MPEG7 out Embedded Automatic Shot change detection Lexicon editing Handle multiple video formats

Page 9: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation9

Model Retrieval tools

Model browser Allows browsing of all models

under a given modality

Primarily for a user of the models

Model analyzer Primarily for model builders

Permits comparisons between different models for a given concept

Presents model statistics such as PR curves, Average Precision

Page 10: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation10

Concept modeling approaches

C

M1 M2 M3

SVMBPM

HMM

GraphicalModel

Page 11: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation11

Outline

Project OverviewTREC 2002Semantic Modeling using Multiple

modalities Concept Detection

Special detector – AV Synchrony

RetrievalSummary

Page 12: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation12

What is TREC?

The NIST coordinated Text Retrieval Conference (TREC) Series of annual information retrieval benchmarks Spawned in 1992 from the DARPA TIPSTER information

retrieval project TREC has become important forum for evaluating and

advancing the state-of-the-art in information retrieval Tracks for tracks for spoken document retrieval, cross

language retrieval, and Web document retrieval and video retrieval

Document collections are huge and standardized Groups participating represent who’s who of IR research

10-12 commercial companies (I.e., Excalibur, Lexis-Nexis, Xerox, IBM)

20-30 university / research groups across all tracks 70% participation from US

Page 13: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation13

Video TREC 02

2nd Video TREC 70 Hours of MPEG1,

partitioned into development, search and feature-test sets

Shot-boundary detection Concept Detection (10

concepts) Benchmarking

Donations

Search 25 Queries (named entities,

events, scenes, objects)

Page 14: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation14

Outline

Project OverviewTREC 2002Semantic Modeling using Multiple

modalities Concept Detection

Special detector – AV Synchrony

RetrievalSummary

Page 15: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation15

Performance of IBM Concept detectors at TREC02

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8A

vera

ge P

recis

ion

Average

Best

IBM

AP = Precision at Relevant Retrieved/Total Ground truth in corpus. Normalized area under the ideal PR curve

Page 16: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation16

Is (Spoken) Text useful to detect Visual Concepts?

Use Speech transcripts to detect visual concepts in shots

Turns concept detection into a speech-based retrieval problem Index Speech transcripts

Use training set to identify useful query terms

Generic approach to improve concept coverage

0

0.1

0.2

0.3

0.4

0.5

0.6

Avera

ge P

recis

ion

VT02 Text

Comparable performance on TREC02 data

Page 17: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation17

Discriminative Model Fusion

Novel approach to build models for concepts using a basis of existing models

Incorporates information about model confidences

Can be viewed as a feature projection

M2 M3 M4 M5 M6

NewConcept

Annotations

| | | | | | | | | “model vector”

New Concept Model

Model vector space

M1

Page 18: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation18

Discriminative Model Fusion: Algorithm

bsxKyxf ii

N

ii

s

),()(0

Support Vector Machine Largest margin hyperplane in the

projected feature space

With good kernel choices, all operations can be done in low-dimensional input feature space

We use Radial Basis Functions as our kernels

Sequential Minimal Optimization algorithm

22 2/||||),( ji xxji exxK

Page 19: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation19

Discriminative Model Fusion: Advantages

Can be used to improve existing models (accuracy) and build new models (coverage)

Can be used to fuse text-based models with content-based models (multimodality)

M1 M2 M3 M4 M5 M6

| | | | | | | | | “model vector”

Model vector space

M1

M1 M2 M3 M4 M5 M6

| | | | | | | | | “model vector”

Model vector space

M9

Page 20: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation20

DMF results Experiment

Build model vector from 6 text-based detectors and 42 pre-existing concept detectors

Build 6 target concept detectors in this model vector space using DMF

Accuracy Results Concepts either improve (by 23-

91%) or stay the same

MAP improves by 12% over all 6 visual concepts.

0

0.1

0.2

0.3

0.4

0.5

0.6

Av

era

ge

Pre

cis

ion

VT02 Text DMF

Page 21: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation21

Outline

Project OverviewTREC 2002Semantic Modeling using Multiple

modalities Concept Detection

Special detector – AV Synchrony

RetrievalSummary

Page 22: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation22

Audio-visual Synchrony detection

Problem: Is it a narration

(voiceover) or a monologue?

Are they synchronous? Plausible? (caused by the same person speaking)

Applications include: ASR Metadata (Speaker

turns)

Talking head detection (video summarization)

Dubbing

Page 23: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation23

Existing Work:

Hershey and Movellan (NIPS 1999) Single Gaussian Model. Mutual Information between

gaussian models

Cutler and Davis (ICME 2000) Time Delay Neural Network

Fisher III et al (NIPS 2001) Learn projection to a common subspace. Projection

maximizes mutual information

Slaney and Covell (NIPS 2001) Learn Canonical Correlation

Page 24: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation24

Approach 1: Evaluate Synchrony Using Mutual Information

Detect faces and speech Convert speech audio and face video into feature

vectors E.g. A1,…,AT = MFC coefficient vectors for audio

E.g. V1,…,VT = DCT coefficients for video Consider each (A1,V1),…,(AT,VT) as an

independent sample from a joint distribution p(A, V)

Evaluate mutual information as Consistency Score I(A;V) assume distributional forms for p(A), p(V) and p(A,V)

Page 25: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation25

1. Build codebooks to quantize audio and video feature vectors (use training set)

TEST SPEECH

TEST FACESEQUENCE

2. Convert test audio and video into feature vectors and quantize using codebooks

A1,…,AT V1,…,VT

3. Use to estimate discrete distributions p(A), p(V) and p(A,V) and calculate Mutual Information

AUDIOCODEBOOK

VIDEOCODEBOOK

Implementation 1: Discrete Distributions (“VQ-MI”)

Page 26: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation26

Implementation 2: Gaussian distributions (“G MI”)

TEST SPEECH

TEST FACESEQUENCE

1. Convert test audio and video into feature vectors

A1,…,AT V1,…,VT

2. Use to estimate multivariate Gaussian distributions p(A), p(V) and p(A,V)(some similarities with Hershey and Movellan, NIPS 1999)

3. Calculate Consistency Score I(A;V)NOTE: long test sequences may not be Gaussian, so divide into locally Gaussian segments using Bayesian Information Criterion (Chen and Gopalakrishnan, 1998)

Page 27: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation27

Approach 2: Evaluate Plausibility (“AV-LL”)

TEST SPEECH

TEST FACESEQUENCE

1. Convert test audio and video into feature vectors

A1,…,AT V1,…,VT

2. Hypothesize uttered word sequence W using audio-only automatic speech recognition (or ground truth script if available)

3. Calculate Consistency Score p(A,V|W) - here, likelihoods from Hidden Markov Models

as used in audio-visual speech recognitionNOTE: preliminary experiments also considered an approximation to p(W|A,V), but results were less successful

Page 28: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation28

Experimental Setup Corpus and test set construction:

Constructed from IBM Via-Voice AV data Full-face, front-facing speakers Continuous, clean speech 1016 four-second long “true” (ie. corresponding) speech and

face combinations extracted For each “true” case, three “confuser” examples pair the same

speech with faces saying something else Separate training data used for training models (for schemes

“VQ-MI”, ”AV-LL”) Pre-processing:

Audio: MFC coefficients Video: DCT coefficients of mouth region-of-interest

Test Purpose: Assume perfect face and speech detection Evaluate usefulness of different consistency definitions

Page 29: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation29

Synchrony results

0

20

40

60

80

100

Det

ecti

on A

ccur

acy

1 2 3

Number of Confusers

VQ MI

AV LL

G MI

•Gaussian clearly superior to VQ and AV schemes•VQ and AV require training possible mismatch between training and test data•For VQ, estimation of discrete densities suffers from resolution/accuracy trade-off

Page 30: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation30

Application to Speaker Localization

Data from Clemson University’s CUAVE corpus:

Investigating two tasks:“Active Speaker”: Is left or right person speaking?

“Active Mouth”: Where is active speaker’s mouth?

Assume only one person speaking at any given time (for now)

Page 31: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation31

Speaker Localization

Task 1: Active Speaker Compute Gaussian-based MI between each video pixel and audio

signal over a time window: Scheme 1: use pixel intensities and audio LogFFT Scheme 2: use “delta-like” features based on changes in

pixel intensity across time (Butz) and audio LogFFT Compare: Total MI (left screen) vs. Total MI (right screen) Shift window and repeat

Task 2: Active Speaker’s mouth Search for compact region with good MI scores No smoothing of region between images Estimate mouth center every second. Considered correct if

estimate is within search pixels of true center.

Page 32: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation32

Speaker Localization: Mutual Information Images

Video Features: Pixel Intensities Video Features: Intensity Deltas

Page 33: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation33

Speaker localization results

Algorithm Task 1:Active Speaker

Task 2:Speaker’s mouth

Pixel projection 81.3% 48.8%

Total pixel intensity change

77.4% 49.2%

Mutual Information

76.2% 64.7%

Note: No significant gain for speaker localization from adding prior face detection. Speaker mouth localization improves by 4% for AVMI and 2% for video-only

Page 34: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation34

Using synchrony to detect monologues in TREC02

Monologue detector Should have a face

Should contain speech

Speech and Face should be synchronized

Threshold the mutual information image at various levels

Ratio of the average mutual information of the highest-scoring NxN pixel region with the average mutual information of the image

Page 35: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation35

Monologue Results

0

0.2

0.4

0.75 0.85 0.95 0.99

F+Sync

F+Sp+Sy

F+Sync F+Speech F+Sp+Sy

IBM Monologue detector best in TREC 2002Using Synchrony does better than Face+Speech alone

0

0.05

0.1

0.15

0.2

0.25

0.3

AP

Page 36: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation36

Outline

Project OverviewTREC 2002Semantic Modeling using Multiple

modalities Concept Detection

Special detector – AV Synchrony

RetrievalSummary

Page 37: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation37

Eg. “RUNS” -> RUNEg. Use 100 word, overlappingwindows

Divide intodocuments

QueryTerm

String

RemoveFrequent

words,POS-tag+ morph

Remove FrequentWords, POS-tag+ morph

Speech transcripts

(ASR)

Video Retrieval using Speech

TREC 2001RECOGNIZER

58.5% WER

IBM PASS 1= HUB-4 MODELS + SUP. ADAPT+SPEECH SEGMENT

40.2% WER

IBM PASS 2 = IBM PASS 1 + UNSUP. ADAPT

34.6% WER

LIMSI TREC 02DONATION

39.0% WER

RETRIEVE:rank

documents

MapDocuments

to shots

Challenges: ASR, text document definition, document ranking, mapping documents to shots

CreateMorphIndex

SYSTEM %Correct (WER)

MAP

TREC-2001

52.8% (58.5%) 0.09

IBM0 59.7% (63.9%) 0.13

IBM1 67.9% (40.2%) 0.17

IBM2 72.8% (34.6%) 0.21

Page 38: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation38

SDR Details: Fusion of Multiple SDR Systems

Take multiple SDR systems OKAPI 1, OKAPI 2, Soft

Boolean using the same ASR

Examine complementarity using common query set:

no system returns a superset of the other system’s results

Form additive weighted combination: “fusion” system

System MAP

OKAPI 1 0.09

OKAPI 2 0.08

Soft Boolean 0.11

Fusion 0.15

The fusion system resulted in 2nd overall performance at TREC02.

Page 39: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation39

Integrating Speech and Video Retrieval: ongoing research TREC 02 SDR + CBR examples:

sometimes multimodal integration improves over top unimodal (“rockets taking off”)

(Fused, SDR, CBR) sometimes multimodal integration degrades top unimodal

Video degrades fusion (Fused, SDR, CBR) (“Parrots”) Audio degrades fusion: (Fused, SDR, CBR) (“nuclear

clouds”)

Use non-speech audio and video cues to improve “speech-score-to-shot-score” score mapping

Manual subsetting of videos results in 44% improvement in MAP. Speech-only ranking of videos results in a 5% improvement Can multimodal cues be used to come closer to manual

performance Is using multimodal cues to rank videos simpler than

ranking shots?

Page 40: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation40

Summary

Information fusion across modalities helps a variety of tasks

Though, speech/text + image-based retrieval remains an open issue

Page 41: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation41

Open Research Challenges

Multimodal Information Fusion is not a solved problem! Combining Text with Image Content for Retrieval

Model performance Progressive improvements in model performance (accuracy) Under limited training data, with increasing complexity (acquisition) Maintain accuracy as number of concepts increase (coverage)

SDR improvements What level of ASR performance is the minimum for optimal retrieval

performance? Limits on text (speech)-based visual models?

Automatic query processing (acquisition)

Page 42: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation42

Discussion

Papers, Presentations, Other work http://www.research.ibm.com/AVSTG/

Page 43: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation43

SDR Details (1): ASR Performance SUMMARY: 41% relative

improvement in Word Error Rate (WER) over TREC 2001 speech transcription approach

VIDEO TREC 2001: used ViaVoice for Broadcast News system 93k prototypes single trigram model

Improve ASR using Watson HUB-4 Broadcast News system 285k prototypes speaker-

independent system mixture of 3 LMs (4 gram, 3 gram,

etc.) Additionally, incorporate:

supervised adaptation (8 videos from training set)

improved speech vs non-speech segmentation

unsupervised test-set adaptation

TREC 2001RECOGNIZER

58.5% WER

IBM PASS 1= HUB-4 MODELS + SUP. ADAPT+SPEECH SEGMENT

40.2% WER

IBM PASS 2 = IBM PASS 1 + UNSUP. ADAPT

34.6% WER

LIMSI TREC 02DONATION

39.0% WER

Page 44: Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research

Multimedia Mining AR Project © 2002 IBM Corporation44

SDR Details (1) ctd: Does improving ASR affect Video Retrieval?

Manually compile (limited) ground truth on FeatureTrain+Validate subset of TREC 2002

set up retrieval systems using ASR of different word error rates (WER’s)

use TREC-2002 set of 25 queries to evaluate Mean Average Precision (MAP)

open question: what is upper limit for MAP given “perfect” ASR?

SYSTEM %Correct (WER) MAP

TREC-2001

52.8% (58.5%) 0.09

IBM0 59.7% (63.9%) 0.13

IBM1 67.9% (40.2%) 0.17

IBM2 72.8% (34.6%) 0.21

Improvements in ASR => Improved Video Retrieval