View
22
Download
0
Category
Preview:
DESCRIPTION
Information Fusion from Multiple Modalities for Multimedia Mining Applications. Giridharan Iyengar (joint work w/ H. Nock) Audio-Visual Speech Technologies Human Language Technologies IBM TJ Watson Research Center. Acknowledgements. John R. Smith and the IBM Video TREC team - PowerPoint PPT Presentation
Citation preview
© 2002 IBM Corporation
IBM Research: Multimedia Mining AR Project
http://www.research.ibm.com/AVSTG | May 1 2003
Information Fusion from Multiple Modalities for Multimedia Mining Applications
Giridharan Iyengar (joint work w/ H. Nock)Audio-Visual Speech TechnologiesHuman Language TechnologiesIBM TJ Watson Research Center
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation2
Acknowledgements
John R. Smith and the IBM Video TREC teamMembers of Human Language Technologies
department
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation3
Outline
Project OverviewTREC 2002Semantic Modeling using Multiple
modalities Concept Detection
Special detector – AV Synchrony
RetrievalSummary
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation4
What do we want?
A framework that facilitates learning and detection of concepts in digital media: Given a concept and annotated example(s), learn
representation for concept in digital media Learn models directly using statistical methods
(e.g. Face detector) Leverage existing concepts and learn mapping of
the new concept to existing concepts (people spending leisure time in the beach)
Given digital media, detect instances of concepts
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation5
Our Thesis
Combining modalities to robustly infer knowledge
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation6
Performance dimensions
Con
cep
t A
cq
uis
itio
n
Concept AccuracyConce
pt Covera
ge
Dimensions are inter-related
Concept Accuracy: What is the accuracy?
What accuracy desired for acquiring new concepts?
Concept Acquisition: How many training
examples for a desired level of accuracy? For a general concept? For a specific concept?
Concept Coverage: How many concepts?
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation7
Multimedia Semantic Learning Framework
Fusion
Retrieval
AudioFeatures(MFCC…)
VisualFeatures(Color…)
VisualSegmentation
Annotation*(MPEG-7)
Video Repository
Non-speechAudio Models
SpeechModels
VisualModels
training
modelsfeatures
*Available from Alphaworks
Signals Features Semantics Subjective
Recentpast
Nearfuture
Today Our Goal
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation8
Multimodal Video Annotation tool (in Alphaworks)
MPEG1 in MPEG7 out Embedded Automatic Shot change detection Lexicon editing Handle multiple video formats
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation9
Model Retrieval tools
Model browser Allows browsing of all models
under a given modality
Primarily for a user of the models
Model analyzer Primarily for model builders
Permits comparisons between different models for a given concept
Presents model statistics such as PR curves, Average Precision
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation10
Concept modeling approaches
C
M1 M2 M3
SVMBPM
HMM
GraphicalModel
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation11
Outline
Project OverviewTREC 2002Semantic Modeling using Multiple
modalities Concept Detection
Special detector – AV Synchrony
RetrievalSummary
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation12
What is TREC?
The NIST coordinated Text Retrieval Conference (TREC) Series of annual information retrieval benchmarks Spawned in 1992 from the DARPA TIPSTER information
retrieval project TREC has become important forum for evaluating and
advancing the state-of-the-art in information retrieval Tracks for tracks for spoken document retrieval, cross
language retrieval, and Web document retrieval and video retrieval
Document collections are huge and standardized Groups participating represent who’s who of IR research
10-12 commercial companies (I.e., Excalibur, Lexis-Nexis, Xerox, IBM)
20-30 university / research groups across all tracks 70% participation from US
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation13
Video TREC 02
2nd Video TREC 70 Hours of MPEG1,
partitioned into development, search and feature-test sets
Shot-boundary detection Concept Detection (10
concepts) Benchmarking
Donations
Search 25 Queries (named entities,
events, scenes, objects)
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation14
Outline
Project OverviewTREC 2002Semantic Modeling using Multiple
modalities Concept Detection
Special detector – AV Synchrony
RetrievalSummary
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation15
Performance of IBM Concept detectors at TREC02
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8A
vera
ge P
recis
ion
Average
Best
IBM
AP = Precision at Relevant Retrieved/Total Ground truth in corpus. Normalized area under the ideal PR curve
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation16
Is (Spoken) Text useful to detect Visual Concepts?
Use Speech transcripts to detect visual concepts in shots
Turns concept detection into a speech-based retrieval problem Index Speech transcripts
Use training set to identify useful query terms
Generic approach to improve concept coverage
0
0.1
0.2
0.3
0.4
0.5
0.6
Avera
ge P
recis
ion
VT02 Text
Comparable performance on TREC02 data
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation17
Discriminative Model Fusion
Novel approach to build models for concepts using a basis of existing models
Incorporates information about model confidences
Can be viewed as a feature projection
M2 M3 M4 M5 M6
NewConcept
Annotations
| | | | | | | | | “model vector”
New Concept Model
Model vector space
M1
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation18
Discriminative Model Fusion: Algorithm
bsxKyxf ii
N
ii
s
),()(0
Support Vector Machine Largest margin hyperplane in the
projected feature space
With good kernel choices, all operations can be done in low-dimensional input feature space
We use Radial Basis Functions as our kernels
Sequential Minimal Optimization algorithm
22 2/||||),( ji xxji exxK
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation19
Discriminative Model Fusion: Advantages
Can be used to improve existing models (accuracy) and build new models (coverage)
Can be used to fuse text-based models with content-based models (multimodality)
M1 M2 M3 M4 M5 M6
| | | | | | | | | “model vector”
Model vector space
M1
M1 M2 M3 M4 M5 M6
| | | | | | | | | “model vector”
Model vector space
M9
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation20
DMF results Experiment
Build model vector from 6 text-based detectors and 42 pre-existing concept detectors
Build 6 target concept detectors in this model vector space using DMF
Accuracy Results Concepts either improve (by 23-
91%) or stay the same
MAP improves by 12% over all 6 visual concepts.
0
0.1
0.2
0.3
0.4
0.5
0.6
Av
era
ge
Pre
cis
ion
VT02 Text DMF
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation21
Outline
Project OverviewTREC 2002Semantic Modeling using Multiple
modalities Concept Detection
Special detector – AV Synchrony
RetrievalSummary
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation22
Audio-visual Synchrony detection
Problem: Is it a narration
(voiceover) or a monologue?
Are they synchronous? Plausible? (caused by the same person speaking)
Applications include: ASR Metadata (Speaker
turns)
Talking head detection (video summarization)
Dubbing
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation23
Existing Work:
Hershey and Movellan (NIPS 1999) Single Gaussian Model. Mutual Information between
gaussian models
Cutler and Davis (ICME 2000) Time Delay Neural Network
Fisher III et al (NIPS 2001) Learn projection to a common subspace. Projection
maximizes mutual information
Slaney and Covell (NIPS 2001) Learn Canonical Correlation
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation24
Approach 1: Evaluate Synchrony Using Mutual Information
Detect faces and speech Convert speech audio and face video into feature
vectors E.g. A1,…,AT = MFC coefficient vectors for audio
E.g. V1,…,VT = DCT coefficients for video Consider each (A1,V1),…,(AT,VT) as an
independent sample from a joint distribution p(A, V)
Evaluate mutual information as Consistency Score I(A;V) assume distributional forms for p(A), p(V) and p(A,V)
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation25
1. Build codebooks to quantize audio and video feature vectors (use training set)
TEST SPEECH
TEST FACESEQUENCE
2. Convert test audio and video into feature vectors and quantize using codebooks
A1,…,AT V1,…,VT
3. Use to estimate discrete distributions p(A), p(V) and p(A,V) and calculate Mutual Information
AUDIOCODEBOOK
VIDEOCODEBOOK
Implementation 1: Discrete Distributions (“VQ-MI”)
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation26
Implementation 2: Gaussian distributions (“G MI”)
TEST SPEECH
TEST FACESEQUENCE
1. Convert test audio and video into feature vectors
A1,…,AT V1,…,VT
2. Use to estimate multivariate Gaussian distributions p(A), p(V) and p(A,V)(some similarities with Hershey and Movellan, NIPS 1999)
3. Calculate Consistency Score I(A;V)NOTE: long test sequences may not be Gaussian, so divide into locally Gaussian segments using Bayesian Information Criterion (Chen and Gopalakrishnan, 1998)
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation27
Approach 2: Evaluate Plausibility (“AV-LL”)
TEST SPEECH
TEST FACESEQUENCE
1. Convert test audio and video into feature vectors
A1,…,AT V1,…,VT
2. Hypothesize uttered word sequence W using audio-only automatic speech recognition (or ground truth script if available)
3. Calculate Consistency Score p(A,V|W) - here, likelihoods from Hidden Markov Models
as used in audio-visual speech recognitionNOTE: preliminary experiments also considered an approximation to p(W|A,V), but results were less successful
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation28
Experimental Setup Corpus and test set construction:
Constructed from IBM Via-Voice AV data Full-face, front-facing speakers Continuous, clean speech 1016 four-second long “true” (ie. corresponding) speech and
face combinations extracted For each “true” case, three “confuser” examples pair the same
speech with faces saying something else Separate training data used for training models (for schemes
“VQ-MI”, ”AV-LL”) Pre-processing:
Audio: MFC coefficients Video: DCT coefficients of mouth region-of-interest
Test Purpose: Assume perfect face and speech detection Evaluate usefulness of different consistency definitions
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation29
Synchrony results
0
20
40
60
80
100
Det
ecti
on A
ccur
acy
1 2 3
Number of Confusers
VQ MI
AV LL
G MI
•Gaussian clearly superior to VQ and AV schemes•VQ and AV require training possible mismatch between training and test data•For VQ, estimation of discrete densities suffers from resolution/accuracy trade-off
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation30
Application to Speaker Localization
Data from Clemson University’s CUAVE corpus:
Investigating two tasks:“Active Speaker”: Is left or right person speaking?
“Active Mouth”: Where is active speaker’s mouth?
Assume only one person speaking at any given time (for now)
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation31
Speaker Localization
Task 1: Active Speaker Compute Gaussian-based MI between each video pixel and audio
signal over a time window: Scheme 1: use pixel intensities and audio LogFFT Scheme 2: use “delta-like” features based on changes in
pixel intensity across time (Butz) and audio LogFFT Compare: Total MI (left screen) vs. Total MI (right screen) Shift window and repeat
Task 2: Active Speaker’s mouth Search for compact region with good MI scores No smoothing of region between images Estimate mouth center every second. Considered correct if
estimate is within search pixels of true center.
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation32
Speaker Localization: Mutual Information Images
Video Features: Pixel Intensities Video Features: Intensity Deltas
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation33
Speaker localization results
Algorithm Task 1:Active Speaker
Task 2:Speaker’s mouth
Pixel projection 81.3% 48.8%
Total pixel intensity change
77.4% 49.2%
Mutual Information
76.2% 64.7%
Note: No significant gain for speaker localization from adding prior face detection. Speaker mouth localization improves by 4% for AVMI and 2% for video-only
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation34
Using synchrony to detect monologues in TREC02
Monologue detector Should have a face
Should contain speech
Speech and Face should be synchronized
Threshold the mutual information image at various levels
Ratio of the average mutual information of the highest-scoring NxN pixel region with the average mutual information of the image
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation35
Monologue Results
0
0.2
0.4
0.75 0.85 0.95 0.99
F+Sync
F+Sp+Sy
F+Sync F+Speech F+Sp+Sy
IBM Monologue detector best in TREC 2002Using Synchrony does better than Face+Speech alone
0
0.05
0.1
0.15
0.2
0.25
0.3
AP
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation36
Outline
Project OverviewTREC 2002Semantic Modeling using Multiple
modalities Concept Detection
Special detector – AV Synchrony
RetrievalSummary
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation37
Eg. “RUNS” -> RUNEg. Use 100 word, overlappingwindows
Divide intodocuments
QueryTerm
String
RemoveFrequent
words,POS-tag+ morph
Remove FrequentWords, POS-tag+ morph
Speech transcripts
(ASR)
Video Retrieval using Speech
TREC 2001RECOGNIZER
58.5% WER
IBM PASS 1= HUB-4 MODELS + SUP. ADAPT+SPEECH SEGMENT
40.2% WER
IBM PASS 2 = IBM PASS 1 + UNSUP. ADAPT
34.6% WER
LIMSI TREC 02DONATION
39.0% WER
RETRIEVE:rank
documents
MapDocuments
to shots
Challenges: ASR, text document definition, document ranking, mapping documents to shots
CreateMorphIndex
SYSTEM %Correct (WER)
MAP
TREC-2001
52.8% (58.5%) 0.09
IBM0 59.7% (63.9%) 0.13
IBM1 67.9% (40.2%) 0.17
IBM2 72.8% (34.6%) 0.21
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation38
SDR Details: Fusion of Multiple SDR Systems
Take multiple SDR systems OKAPI 1, OKAPI 2, Soft
Boolean using the same ASR
Examine complementarity using common query set:
no system returns a superset of the other system’s results
Form additive weighted combination: “fusion” system
System MAP
OKAPI 1 0.09
OKAPI 2 0.08
Soft Boolean 0.11
Fusion 0.15
The fusion system resulted in 2nd overall performance at TREC02.
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation39
Integrating Speech and Video Retrieval: ongoing research TREC 02 SDR + CBR examples:
sometimes multimodal integration improves over top unimodal (“rockets taking off”)
(Fused, SDR, CBR) sometimes multimodal integration degrades top unimodal
Video degrades fusion (Fused, SDR, CBR) (“Parrots”) Audio degrades fusion: (Fused, SDR, CBR) (“nuclear
clouds”)
Use non-speech audio and video cues to improve “speech-score-to-shot-score” score mapping
Manual subsetting of videos results in 44% improvement in MAP. Speech-only ranking of videos results in a 5% improvement Can multimodal cues be used to come closer to manual
performance Is using multimodal cues to rank videos simpler than
ranking shots?
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation40
Summary
Information fusion across modalities helps a variety of tasks
Though, speech/text + image-based retrieval remains an open issue
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation41
Open Research Challenges
Multimodal Information Fusion is not a solved problem! Combining Text with Image Content for Retrieval
Model performance Progressive improvements in model performance (accuracy) Under limited training data, with increasing complexity (acquisition) Maintain accuracy as number of concepts increase (coverage)
SDR improvements What level of ASR performance is the minimum for optimal retrieval
performance? Limits on text (speech)-based visual models?
Automatic query processing (acquisition)
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation42
Discussion
Papers, Presentations, Other work http://www.research.ibm.com/AVSTG/
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation43
SDR Details (1): ASR Performance SUMMARY: 41% relative
improvement in Word Error Rate (WER) over TREC 2001 speech transcription approach
VIDEO TREC 2001: used ViaVoice for Broadcast News system 93k prototypes single trigram model
Improve ASR using Watson HUB-4 Broadcast News system 285k prototypes speaker-
independent system mixture of 3 LMs (4 gram, 3 gram,
etc.) Additionally, incorporate:
supervised adaptation (8 videos from training set)
improved speech vs non-speech segmentation
unsupervised test-set adaptation
TREC 2001RECOGNIZER
58.5% WER
IBM PASS 1= HUB-4 MODELS + SUP. ADAPT+SPEECH SEGMENT
40.2% WER
IBM PASS 2 = IBM PASS 1 + UNSUP. ADAPT
34.6% WER
LIMSI TREC 02DONATION
39.0% WER
IBM Research
Multimedia Mining AR Project © 2002 IBM Corporation44
SDR Details (1) ctd: Does improving ASR affect Video Retrieval?
Manually compile (limited) ground truth on FeatureTrain+Validate subset of TREC 2002
set up retrieval systems using ASR of different word error rates (WER’s)
use TREC-2002 set of 25 queries to evaluate Mean Average Precision (MAP)
open question: what is upper limit for MAP given “perfect” ASR?
SYSTEM %Correct (WER) MAP
TREC-2001
52.8% (58.5%) 0.09
IBM0 59.7% (63.9%) 0.13
IBM1 67.9% (40.2%) 0.17
IBM2 72.8% (34.6%) 0.21
Improvements in ASR => Improved Video Retrieval
Recommended