ICME 2004
Tzvetanka I. Ianeva
Arjen P. de Vries
Thijs Westerveld
A Dynamic Probabilistic Multimedia Retrieval Model
ICME 2004
Introduction• Video Representation schemes used for
retrieval:– Static– Spatio-temporal
• Video is a temporal media so a ‘good’ model solves the limitations of keyframe-based shot representation
ICME 2004
Spatio-temporal grouping
• Spatial priority and tracking of regions from frame to frame
• Joint spatial and temporal segmentation– Human vision finds salient structures jointly in space and
time (Gepshtein and Kubovy, 2000)
ICME 2004
Motivation
• Pursue video retrieval instead of image (keyframe) retrieval
• Extension of the Static Probabilistic Multimedia Retrieval model (2003)
• GMM in DCT-space-time domain– Diagonal covariance
ICME 2004
Static ModelDocs Models
•Indexing
- Estimate Gaussian Mixture Models from images using EM
- Based on feature vector with colour, texture and position information from pixel blocks
- Fixed number of components
ICME 2004
Static Model• Indexing
–Estimate a Gaussian Mixture Model from each keyframe (using EM)
–Fixed number of components (C=8)
–Feature vectors contain colour, texture, and position information from pixel blocks: < x,y,DCT >
ICME 2004
Static ModelModels
P(Q|M1)
P(Q|M4)
P(Q|M3)
P(Q|M2)
Query
• Retrieval–Calculate
conditional probabilities of query samples given models in collection
ICME 2004
Dynamic Model
• Selecting frames
– 1 second sequence around the keyframe
– Entire video shot as sequence of frames sampled at regular intervals
• Features < x, y, t, DCT >
ICME 2004
Dynamic Model
• Indexing:•GMM of
multiple frames around keyframe
•Feature vectors extended with time-stamp normalized in [0,1]: <x,y,t,DCT>
0
.5
1
ICME 2004
Dynamic Model
ICME 2004
Query example: A single image
• Artificial sequence of 29 images as the single query example where the time is normalized between 0 and 1
• Extend the query example image’s features with a fixed temporal feature value of 0.5
– Better results and lower computational cost
ICME 2004
Dynamic Model Advantages
• More training data for models– Less sensitive to random initialization
• Reduced dependency upon selecting appropriate keyframe
• Some spatio-temporal aspects of shot are captured– (Dis-)appearance of objects
ICME 2004
Dynamic Model
ICME 2004
Dynamic Model
ICME 2004
Dynamic Model
ICME 2004
Retrieval Framework• Smoothing
• Building dynamic GMMs
Likelihood goes to infinity ???
N
jjiji xPkwxkP
NwRSV
11log
1
Nc
ccicicii xGCPwxP
1,,, ,,
xx
nexG
1
2
1
2
1,,
ICME 2004
Experimental Set-up
• Build models for each shot– Static, Dynamic, Language
• Build Queries from topics– Construct simple keyword text query– Select visual example– Rescale and compress example images to
match video size and quality
ICME 2004
Combining Modalities• Independence assumption textual/visual
– P(Qt,Qv|Shot) = P(Qt|LM) * P(Qv|GMM)
• Combination works if both runs useful [CWI:TREC:2002]
• Dynamic run moreuseful than static run
Run MAP
ASR only .130
Static only .022
Static+ASR .105
Dynamic only .022
Dynamic+ASR .132
ICME 2004
Combining Modalities
Dynamic: Higher Initial Precision
ICME 2004
Dynamic: Higher initial precision
Static run
Dynamic run
ICME 2004
Dow Jones Topic (120)
ICME 2004
Dow Jones Topic (120)• “Dow Jones Industrial Average
rise day points”
+
=
ICME 2004
Conclusions
• Dynamic model captures visual similarity better– Spatio-temporal aspects– More training data– Apropriate key-frame less critical– Less sensitive to the random initialization
• ASR + dynamic better than either alone
ICME 2004
Future work• More data needs more computation effort – optimizations ?
• Avoid the singular solutions Dynamic number of components ?
• Full covariance in space-time < x,y,t >
• Integration of audio
ICME 2004
Thanks !!!
ICME 2004
Merging Run Results
• Combining (conflicting) examples difficult [CWI:TREC:2002]
• Single example Miss relevant shots
• Round-Robin Merging
123456789
10
123456789
10
Combined
11223344..
ICME 2004
Merging Run Results
ICME 2004
Merging Run Results
• Combining (conflicting) examples difficult [CWI:TREC:2002]
• Single example Miss relevant shots
• Round-Robin Merging
Combined
11223344..
123456789
10
123456789
10
+ASR
Single .022 .132
All .031 .149
Selected .039 .151
Best .050 .155
ICME 2004
Conclusions
• Visual aspects of an information need are best captured by using multiple examples
• Combining results for multiple (good) examples in round-robin fashion, each ranked on both modalities, gives near-best performance for almost all topics