Information Fusion from Multiple Modalities for Multimedia Mining Applications

IBM Research: Multimedia Mining AR Project

http://www.research.ibm.com/AVSTG | May 1 2003

Giridharan Iyengar (joint work w/ H. Nock)Audio-Visual Speech TechnologiesHuman Language TechnologiesIBM TJ Watson Research Center

IBM Research

Acknowledgements

John R. Smith and the IBM Video TREC teamMembers of Human Language Technologies

department

IBM Research

Outline

Project OverviewTREC 2002Semantic Modeling using Multiple

modalities Concept Detection

Special detector – AV Synchrony

RetrievalSummary

IBM Research

What do we want?

A framework that facilitates learning and detection of concepts in digital media: Given a concept and annotated example(s), learn

representation for concept in digital media Learn models directly using statistical methods

(e.g. Face detector) Leverage existing concepts and learn mapping of

the new concept to existing concepts (people spending leisure time in the beach)

Given digital media, detect instances of concepts

IBM Research

Our Thesis

Combining modalities to robustly infer knowledge

IBM Research

Performance dimensions

Concept AccuracyConce

pt Covera

Dimensions are inter-related

Concept Accuracy: What is the accuracy?

What accuracy desired for acquiring new concepts?

Concept Acquisition: How many training

examples for a desired level of accuracy? For a general concept? For a specific concept?

Concept Coverage: How many concepts?

IBM Research

Multimedia Semantic Learning Framework

Fusion

Retrieval

AudioFeatures(MFCC…)

VisualFeatures(Color…)

VisualSegmentation

Annotation*(MPEG-7)

Video Repository

Non-speechAudio Models

SpeechModels

VisualModels

training

modelsfeatures

*Available from Alphaworks

Signals Features Semantics Subjective

Recentpast

Nearfuture

Today Our Goal

IBM Research

Multimodal Video Annotation tool (in Alphaworks)

MPEG1 in MPEG7 out Embedded Automatic Shot change detection Lexicon editing Handle multiple video formats

IBM Research

Model Retrieval tools

Model browser Allows browsing of all models

under a given modality

Primarily for a user of the models

Model analyzer Primarily for model builders

Permits comparisons between different models for a given concept

Presents model statistics such as PR curves, Average Precision

IBM Research

Concept modeling approaches

M1 M2 M3

SVMBPM

GraphicalModel

IBM Research

Outline

RetrievalSummary

IBM Research

What is TREC?

The NIST coordinated Text Retrieval Conference (TREC) Series of annual information retrieval benchmarks Spawned in 1992 from the DARPA TIPSTER information

retrieval project TREC has become important forum for evaluating and

advancing the state-of-the-art in information retrieval Tracks for tracks for spoken document retrieval, cross

language retrieval, and Web document retrieval and video retrieval

Document collections are huge and standardized Groups participating represent who’s who of IR research

10-12 commercial companies (I.e., Excalibur, Lexis-Nexis, Xerox, IBM)

20-30 university / research groups across all tracks 70% participation from US

IBM Research

Video TREC 02

2nd Video TREC 70 Hours of MPEG1,

partitioned into development, search and feature-test sets

Shot-boundary detection Concept Detection (10

concepts) Benchmarking

Donations

Search 25 Queries (named entities,

events, scenes, objects)

IBM Research

Outline

RetrievalSummary

IBM Research

Performance of IBM Concept detectors at TREC02

Average

AP = Precision at Relevant Retrieved/Total Ground truth in corpus. Normalized area under the ideal PR curve

IBM Research

Is (Spoken) Text useful to detect Visual Concepts?

Use Speech transcripts to detect visual concepts in shots

Turns concept detection into a speech-based retrieval problem Index Speech transcripts

Use training set to identify useful query terms

Generic approach to improve concept coverage

VT02 Text

Comparable performance on TREC02 data

IBM Research

Discriminative Model Fusion

Novel approach to build models for concepts using a basis of existing models

Incorporates information about model confidences

Can be viewed as a feature projection

M2 M3 M4 M5 M6

NewConcept

Annotations

New Concept Model

Model vector space

IBM Research

Discriminative Model Fusion: Algorithm

bsxKyxf ii

),()(0

Support Vector Machine Largest margin hyperplane in the

projected feature space

With good kernel choices, all operations can be done in low-dimensional input feature space

We use Radial Basis Functions as our kernels

Sequential Minimal Optimization algorithm

22 2/||||),( ji xxji exxK

IBM Research

Discriminative Model Fusion: Advantages

Can be used to improve existing models (accuracy) and build new models (coverage)

Can be used to fuse text-based models with content-based models (multimodality)

M1 M2 M3 M4 M5 M6

Model vector space

M1 M2 M3 M4 M5 M6

Model vector space

IBM Research

DMF results Experiment

Build model vector from 6 text-based detectors and 42 pre-existing concept detectors

Build 6 target concept detectors in this model vector space using DMF

Accuracy Results Concepts either improve (by 23-

91%) or stay the same

MAP improves by 12% over all 6 visual concepts.

VT02 Text DMF

IBM Research

Outline

RetrievalSummary

IBM Research

Audio-visual Synchrony detection

Problem: Is it a narration

(voiceover) or a monologue?

Are they synchronous? Plausible? (caused by the same person speaking)

Applications include: ASR Metadata (Speaker

turns)

Talking head detection (video summarization)

Dubbing

IBM Research

Existing Work:

Hershey and Movellan (NIPS 1999) Single Gaussian Model. Mutual Information between

gaussian models

Cutler and Davis (ICME 2000) Time Delay Neural Network

Fisher III et al (NIPS 2001) Learn projection to a common subspace. Projection

maximizes mutual information

Slaney and Covell (NIPS 2001) Learn Canonical Correlation

IBM Research

Approach 1: Evaluate Synchrony Using Mutual Information

Detect faces and speech Convert speech audio and face video into feature

vectors E.g. A1,…,AT = MFC coefficient vectors for audio

E.g. V1,…,VT = DCT coefficients for video Consider each (A1,V1),…,(AT,VT) as an

independent sample from a joint distribution p(A, V)

Evaluate mutual information as Consistency Score I(A;V) assume distributional forms for p(A), p(V) and p(A,V)

IBM Research

1. Build codebooks to quantize audio and video feature vectors (use training set)

TEST SPEECH

TEST FACESEQUENCE

2. Convert test audio and video into feature vectors and quantize using codebooks

A1,…,AT V1,…,VT

3. Use to estimate discrete distributions p(A), p(V) and p(A,V) and calculate Mutual Information

AUDIOCODEBOOK

VIDEOCODEBOOK

Implementation 1: Discrete Distributions (“VQ-MI”)

IBM Research

Implementation 2: Gaussian distributions (“G MI”)

TEST SPEECH

TEST FACESEQUENCE

1. Convert test audio and video into feature vectors

A1,…,AT V1,…,VT

2. Use to estimate multivariate Gaussian distributions p(A), p(V) and p(A,V)(some similarities with Hershey and Movellan, NIPS 1999)

3. Calculate Consistency Score I(A;V)NOTE: long test sequences may not be Gaussian, so divide into locally Gaussian segments using Bayesian Information Criterion (Chen and Gopalakrishnan, 1998)

IBM Research

Approach 2: Evaluate Plausibility (“AV-LL”)

TEST SPEECH

TEST FACESEQUENCE

1. Convert test audio and video into feature vectors

A1,…,AT V1,…,VT

2. Hypothesize uttered word sequence W using audio-only automatic speech recognition (or ground truth script if available)

3. Calculate Consistency Score p(A,V|W) - here, likelihoods from Hidden Markov Models

as used in audio-visual speech recognitionNOTE: preliminary experiments also considered an approximation to p(W|A,V), but results were less successful

IBM Research

Experimental Setup Corpus and test set construction:

Constructed from IBM Via-Voice AV data Full-face, front-facing speakers Continuous, clean speech 1016 four-second long “true” (ie. corresponding) speech and

face combinations extracted For each “true” case, three “confuser” examples pair the same

speech with faces saying something else Separate training data used for training models (for schemes

“VQ-MI”, ”AV-LL”) Pre-processing:

Audio: MFC coefficients Video: DCT coefficients of mouth region-of-interest

Test Purpose: Assume perfect face and speech detection Evaluate usefulness of different consistency definitions

IBM Research

Synchrony results

Number of Confusers

•Gaussian clearly superior to VQ and AV schemes•VQ and AV require training possible mismatch between training and test data•For VQ, estimation of discrete densities suffers from resolution/accuracy trade-off

IBM Research

Application to Speaker Localization

Data from Clemson University’s CUAVE corpus:

Investigating two tasks:“Active Speaker”: Is left or right person speaking?

“Active Mouth”: Where is active speaker’s mouth?

Assume only one person speaking at any given time (for now)

IBM Research

Speaker Localization

Task 1: Active Speaker Compute Gaussian-based MI between each video pixel and audio

signal over a time window: Scheme 1: use pixel intensities and audio LogFFT Scheme 2: use “delta-like” features based on changes in

pixel intensity across time (Butz) and audio LogFFT Compare: Total MI (left screen) vs. Total MI (right screen) Shift window and repeat

Task 2: Active Speaker’s mouth Search for compact region with good MI scores No smoothing of region between images Estimate mouth center every second. Considered correct if

estimate is within search pixels of true center.

IBM Research

Speaker Localization: Mutual Information Images

Video Features: Pixel Intensities Video Features: Intensity Deltas

IBM Research

Speaker localization results

Algorithm Task 1:Active Speaker

Task 2:Speaker’s mouth

Pixel projection 81.3% 48.8%

Total pixel intensity change

77.4% 49.2%

Mutual Information

76.2% 64.7%

Note: No significant gain for speaker localization from adding prior face detection. Speaker mouth localization improves by 4% for AVMI and 2% for video-only

IBM Research

Using synchrony to detect monologues in TREC02

Monologue detector Should have a face

Should contain speech

Speech and Face should be synchronized

Threshold the mutual information image at various levels

Ratio of the average mutual information of the highest-scoring NxN pixel region with the average mutual information of the image

IBM Research

Monologue Results

0.75 0.85 0.95 0.99

F+Sync

F+Sp+Sy

F+Sync F+Speech F+Sp+Sy

IBM Monologue detector best in TREC 2002Using Synchrony does better than Face+Speech alone

IBM Research

Outline

RetrievalSummary

IBM Research

Eg. “RUNS” -> RUNEg. Use 100 word, overlappingwindows

Divide intodocuments

QueryTerm

String

RemoveFrequent

words,POS-tag+ morph

Remove FrequentWords, POS-tag+ morph

Speech transcripts

Video Retrieval using Speech

TREC 2001RECOGNIZER

58.5% WER

IBM PASS 1= HUB-4 MODELS + SUP. ADAPT+SPEECH SEGMENT

40.2% WER

IBM PASS 2 = IBM PASS 1 + UNSUP. ADAPT

34.6% WER

LIMSI TREC 02DONATION

39.0% WER

RETRIEVE:rank

documents

MapDocuments

to shots

Challenges: ASR, text document definition, document ranking, mapping documents to shots

CreateMorphIndex

SYSTEM %Correct (WER)

TREC-2001

52.8% (58.5%) 0.09

IBM0 59.7% (63.9%) 0.13

IBM1 67.9% (40.2%) 0.17

IBM2 72.8% (34.6%) 0.21

IBM Research

SDR Details: Fusion of Multiple SDR Systems

Take multiple SDR systems OKAPI 1, OKAPI 2, Soft

Boolean using the same ASR

Examine complementarity using common query set:

no system returns a superset of the other system’s results

Form additive weighted combination: “fusion” system

System MAP

OKAPI 1 0.09

OKAPI 2 0.08

Soft Boolean 0.11

Fusion 0.15

The fusion system resulted in 2nd overall performance at TREC02.

IBM Research

Integrating Speech and Video Retrieval: ongoing research TREC 02 SDR + CBR examples:

sometimes multimodal integration improves over top unimodal (“rockets taking off”)

(Fused, SDR, CBR) sometimes multimodal integration degrades top unimodal

Video degrades fusion (Fused, SDR, CBR) (“Parrots”) Audio degrades fusion: (Fused, SDR, CBR) (“nuclear

clouds”)

Use non-speech audio and video cues to improve “speech-score-to-shot-score” score mapping

Manual subsetting of videos results in 44% improvement in MAP. Speech-only ranking of videos results in a 5% improvement Can multimodal cues be used to come closer to manual

performance Is using multimodal cues to rank videos simpler than

ranking shots?

IBM Research

Summary

Information fusion across modalities helps a variety of tasks

Though, speech/text + image-based retrieval remains an open issue

IBM Research

Open Research Challenges

Multimodal Information Fusion is not a solved problem! Combining Text with Image Content for Retrieval

Model performance Progressive improvements in model performance (accuracy) Under limited training data, with increasing complexity (acquisition) Maintain accuracy as number of concepts increase (coverage)

SDR improvements What level of ASR performance is the minimum for optimal retrieval

performance? Limits on text (speech)-based visual models?

Automatic query processing (acquisition)

IBM Research

Discussion

Papers, Presentations, Other work http://www.research.ibm.com/AVSTG/

IBM Research

SDR Details (1): ASR Performance SUMMARY: 41% relative

improvement in Word Error Rate (WER) over TREC 2001 speech transcription approach

VIDEO TREC 2001: used ViaVoice for Broadcast News system 93k prototypes single trigram model

Improve ASR using Watson HUB-4 Broadcast News system 285k prototypes speaker-

independent system mixture of 3 LMs (4 gram, 3 gram,

etc.) Additionally, incorporate:

supervised adaptation (8 videos from training set)

improved speech vs non-speech segmentation

unsupervised test-set adaptation

TREC 2001RECOGNIZER

58.5% WER

IBM PASS 1= HUB-4 MODELS + SUP. ADAPT+SPEECH SEGMENT

40.2% WER

IBM PASS 2 = IBM PASS 1 + UNSUP. ADAPT

34.6% WER

LIMSI TREC 02DONATION

39.0% WER

IBM Research

SDR Details (1) ctd: Does improving ASR affect Video Retrieval?

Manually compile (limited) ground truth on FeatureTrain+Validate subset of TREC 2002

set up retrieval systems using ASR of different word error rates (WER’s)

use TREC-2002 set of 25 queries to evaluate Mean Average Precision (MAP)

open question: what is upper limit for MAP given “perfect” ASR?

SYSTEM %Correct (WER) MAP

TREC-2001

52.8% (58.5%) 0.09

IBM0 59.7% (63.9%) 0.13

IBM1 67.9% (40.2%) 0.17

IBM2 72.8% (34.6%) 0.21

Improvements in ASR => Improved Video Retrieval

Information Fusion from Multiple Modalities for Multimedia Mining Applications

Documents

Light Modalities

A multimedia information fusion framework for web image ... · Multimedia information plays an increasingly important role in human’s daily ac-tivities. With the rapid development

AGAINST MODALITIES

Multilayer and Multimodal Fusion of Deep Neural … › sites › default › files › pubs › ...for video analysis [8, 46] thanks to the two complementary modalities and outstanding

An Investigation Into Weighted Data Fusion for Content ...doras.dcu.ie/14877/1/wilkins_thesis.pdf · An Investigation Into Weighted Data Fusion for Content-Based Multimedia Information

Columbia-UCF TRECVID2010 Multimedia Event Detection ... · Run6 Baseline – average fusion of 3 SVM classification results for each event using 3 feature modalities: 1) spatial-temporal

Perceptual Modalities

Information Fusion in Multimedia Information …...frameworks. Lately the information fusion community did important progress in fusion theory that have not yet been considered for

An Introduction to Multimedia Fusion 2docshare02.docshare.tips/files/25524/255247860.pdf · An Introduction to Multimedia Fusion 2 Lessons to Create 2D Games Karen Hult, Multimedia

o2 modalities

Multimedia Concepts and Applications Multimedia Concepts and Applications Affect Sensing in Speech: Studying Fusion of Linguistic and Acoustic Features

Treatment Modalities

Fundamentals of Multimedia, Chapter 1elgammal/classes/cs443/slide1.pdf · Fundamentals of Multimedia, Chapter 1 Components of Multimedia Multimedia involves multiple modalities of

Modalities SFURTI

Thermal Modalities ESAT 3640 Therapeutic Modalities

Learning Modalities

WEB + TV = Multimedia Information Fusion Victor Kulesh and Valery Petrushin

A Model of Multimodal Fusion for Medical Applicationsshapiro/Multimedia/lynnspie.pdf · A Model of Multimodal Fusion for Medical Applications S. Yang 1, I. Atmosukarto 1, J. Franklin

Multimedia Fusion 2 and Android - PlanetBravo Summer · PDF fileMultimedia Fusion 2 and Android Multimedia Fusion 2 allows you to build your application as an Android application

2 ﻦﺟﻮﯿﻓ ﺎﯾﺪﯿﻤﺘﻠﻣ ﻰﻠﻋ ﺐﯾرﺪﺗ · 2015-10-15 · 2 ﻦﺟﻮﯿﻓ ﺎﯾﺪﯿﻤﺘﻠﻣ ﻰﻠﻋ ﺐﯾرﺪﺗ:ﻦﻋ Multimedia Fusion