Video Search Engines and
Content-Based Retrieval
Steven C.H. Hoi
CUHK, CSE
18-Sept, 2006
Outline
Video Search Engines
Content-Based Video Retrieval
Video Search Engines
A survey of state-of-the-arts
Introduction
Who are doing video search engines?
Top text search engines5.6 billion searches
07/2006
Introduction Google
Introduction Yahoo
Introduction MSN/Live Search
Introduction YouTube
Business Models Web Advertising
Site Volume, or keyword customized Video Ads
Disable controls (MSN) Subscription
MLB, Real Download to own
iTunes, Movie Rental
Limited time, number of plays Other
Desktop Media Search Media player (jukebox) Media Monitoring Media Asset Management
Types of video Sites Content Originators
Major Broadcasters Affiliates, Local News Major League Baseball
Syndication, Aggregation, “Internet Broadcasters” Rental, purchase, advertising, subscription MSN, Google, iTunes ROO Media, FeedRoom
Movie and Video Download Share portals
Consumer content, blogs YouTube, Putfile, Vsocial, Google, Akimbo
Traditional Search Engines (Crawl) / “RSS” Yahoo, Blinkx
Other Public (Internet Archive) Media Monitoring, asset management systems
Video Search Challenges
Current Video Search Engines
Metadata File type and context Media file attributes
Size, length Structured global metadata
RSS content description
Content Content Indexing
Search within a video Full text of dialog Image or video content
Automated Content Indexing
Current Video Search Engines
Content Search Engines
Keyword search with transcripts from speech recognition
Content-Based Video Search Engine
Architecture
Content-Based Video Search Engine
Video Processing
Content-Based Video Search Engine
Research ChallengesSpeech RecognitionShot Boundary DetectionVideo Story Segmentation Concept DetectionMulti-modal Fusion for Ranking
Text/ASR, Audio/Speech, Visual, etc.
Content-Based Retrieval
Our Research ProblemLearning to rank video shots for automatic
content-based search tasks !
ChallengesMulti-Modal Information FusionSmall Sample Learning (a few pos. & no neg.)Learning on large-scale datasets
Multi-modal and Multi-scale Ranking Framework
Main IdeasRepresenting video structures by graphsUsing semi-supervised learning to address
small labeled sample learning problemFusing Multi-modal information by Harmonic
learning over graphsMulti-scale ranking for achieving efficient
performance on large-scale datasets
Multi-modal and Multi-scale Ranking Framework
Graph-based Modeling
StoryText
Shot
Multi-modal and Multi-scale Ranking Framework
Semi-Supervised Learning on GraphTo find an optimal real-valued function
g: VR on the graph GTo minimize a quadratic energy function:
Using Gaussian field and Harmonic property of Spectral Graph Theory (J. Zhu’s ICML’03), a harmonic function g can be found:
Multi-modal and Multi-scale Ranking Framework
Semi-Supervised Learning on GraphLet
The solution of the harmonic function g can be expressed in matrix operations:
Multi-modal and Multi-scale Ranking Framework
Multi-Modal Fusion over GraphTo combine text information into SSL on visual
modality, we consider the text inputs as the attached nodes on the visual graph:
Visual - g
Text - f
Multi-modal and Multi-scale Ranking Framework
ChallengesNumber of examples in database: N is large
For examples:TRECVID 2005: Rep. Key-Frames N = 45,765TRECVID 2006: Rep. Key-Frames N = 79,487
How to do Semi-Supervised Learning?!
Multi-modal and Multi-scale Ranking Framework
Multi-Scale RankingLearning ranking through multi-scale rerankingEach stage is associated with different
computational costsIn our solution, four ranking stages include:
Ranking by Text Retrieval using Language ModelsRe-ranking by NN fusing Text and VisualRe-ranking by SVM fusing Text and VisualRe-ranking by multi-modal Semi-supervised Learning
Top M related Stories
Text
Top N2 related Shots
Text + Visual NN
SVM/KLR
Top N3 related Shots
Top N4 related Shots
SSR
Video Stories
Video Shots
Top N1 related Shots
Text Processing
VideoProcessing
User’s Queryreturn top K shots
Multi-modal Fusion
Mu
lti-sc
ale
Ra
nk
ing
Image Processing
Raw
Video C
lips / Stream
s
Semi-Supervised Ranking
Supervised Ranking
Benchmark Evaluations
DatasetTRECVID 2005Test: 140 video clips, 45,765 rep. key frames24 queriesA query example:
<videoTopic num="0152">
<textDescription text="Find shots of Hu Jintao, president of the People's Republic of China" /> </videoTopic>
Benchmark Evaluations Text-only Retrieval
No Pseudo-Relevance Feedback (No-PRF)
With Pseudo-Relevance Feedback (PRF)
Evaluation of Language Models
0
0.02
0.04
0.06
0.08
0.1
MA
P No-PRF
PRF Language Models TF-IDF Okapi KL-JM KL-DIR KL-ABS
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Text-only Results
MA
P
IBM
Columbia
TRECVID-Max
CUHK
Benchmark Evaluations Visual Features
Color Grid Color Moment 3*3 grid, 81-dimensions
Edge Edge Direction Histogram 36 bin+1, 37-dimensions
Texture Gabor Moments 5*8=40, 3 moments,120
dimensions
238 dimensions in total
Normalized Comparison
0
0.1
0.2
0.3
0.4
0.5
0.6
0 20 40 60 80 100 120
GCM
EDH
Gabor
GCM+Gabor+EDH
COREL Benchmark Photos
Benchmark Evaluations
Multi-modal Retrieval (Text + Visual)Text-only retrievalText + NN (Text + Visual)Text + SVM (Text + Visual)MMMS (Text + Visual)
Benchmark Evaluations
MAP Num_Ret Improvement
Text 0.0903 1669 0%
Text+NN 0.1034 1705 +14.51%
Text+SVM 0.1083 1764 +19.93%
MMMS 0.1157 1764 +28.13%
Average Performance on TRECVID 2005 Dataset
Evaluation Results
Benchmark Evaluations
0.095
0.1
0.105
0.11
0.115
0.12M
AP
IBM (T+V+M)
CUHK-MMMS
Columbia (V+T+M)
IBM (V+T)
Average performance of 24 queries
Comparison with other approaches
Related Work
IBM Solution SVM + NN + Multiple Instance Learning
Columbia solutionInformation-Theoretical Clustering Approach
CMU SolutionQuery-Class Dependent Weighting Ranking
Conclusion
A tutorial of video search engines Research contributions
A Unified framework of Multi-Modal and Multi-Scale Ranking for video retrieval
Graph-based Modeling of video structuresSemi-Supervised Learning for Multimodal
RankingMaking SSL practical for large-scale problemsPromising empirical results…
Future Work
Research is in progress, tough ahead…
Any suggestions or comments?