Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
INTELLIGENT MULTIMODEL CONTENT BASED VIDEO
RETRIEVAL
Synopsis
Submitted by
PRASANNA S.
In partial fulfillment for the award of the degree
Of
DOCTOR OF PHILOSOPHY
Under the guidance of
Dr.Purushothaman S.
BE, ME, PhD (IIT Madras)
DEPARTMENT OF MCA
SCHOOL OF COMPUTING SCIENCES
PALLAVARAM, CHENNAI – 600 117.
2
1. INTRODUCTION
This thesis presents retrieving a video from a given database using
artificial neural network (ANN) methods. The proposed algorithms use
features of the videos for training and testing ANN. The features of frames of
a video are extracted from the contents in the form of text, audio and image
in the frames. In this work, multimodel refers to using more than one
information to retrieve a video. Applications of video retrieval include the
News archives which use large amounts of video daily and must be archived.
2. REVIEW OF LITERATURE
Video will be one of the key issues in the upcoming Internet evolution
in infotainment and education Converting raw video streams into highly and
thoroughly structured and indexed, web-ready, database-driven information
entities is a must. Information databases have evolved from simple text to
multimedia with video, audio, and text. The query mechanism for such a
video database is similar in concept compared to a textual database. Object
searching is analogous to word searching; scene browsing is similar to
paragraph searching; video indexing is comparable to text indexing or
bookmarking. Implementation for video content based searching is very
different and much more difficult than the query mechanism for a textual
database. Scene change detection technique can often be applied for scene
browsing and automatic and intelligent video indexing of video sequences for
video databases. Once individual video scenes are identified, we can use
content based indexing mechanisms (such as indexing by object texture,
shape, color, motion, etc.) to index and query image contents in each video
scene.
There are many information carriers in a video stream, as is the visual
content, the narrative or speech part, text captions. Visual content remains
3
the most important. We prefer representative image from a long scene with
little -or no- change. This process is the key frame extraction. Two kinds of
key frame extraction strategies have been developed and used by various
researchers. The simplest way is to select one or several frames from each
segmented shot. Some use the first frame of each shot as the representative
frame, i.e., the key frame of the shot. Others may use a random one, the
last one or the middle one as the prototypical frame.
Lots of research works have been done by earlier researchers on
retrieval of videos. Web browsers and email service providers have
developed many indexing methods and retrieval algorithms for retrieving
videos. In addition to the existing development we are focusing on
implementing intelligent methods like ANN.
Berna Erol et.al, 2000 Object-based video representation, such as the
one suggested by the MPEG-4 standard, offers a framework that is better
suited for object-based video indexing and retrieval. In such a framework,
the concept of a “key frame” is replaced by that of a “key video object
plane.” The authors proposed a method for key video object plane selection
using the shape information in the MPEG-4 compressed domain. The shape
of the video object (VO) is approximated using the shape coding modes of I,
P, and B video object planes (VOPs) without decoding the shape information
in the MPEG-4 bit stream. Two popular shape distance measures, the
Hamming and Hausdorff distance measures were modified to measure the
similarities between the approximated shapes of the video objects.
Approaches mainly differ in the set of acoustic features used to
represent the audio signal and the classification technique applied, Sofia et
al, 2001. In speech research, features such as Mel frequency cepstral
coefficients (MFCCs) or linear prediction coefficients (LPCs) have been
demonstrated to provide good representations of a speech signal allowing
for better discrimination than temporal or frequency based features alone,
4
Holmes et al, 2001. Liu et al, 2001 conclude from their study that cepstral
features such as MFCCs perform better than temporal or frequency based
features and advocate their use for general audio tasks particularly when the
number of audio classes is large.
Semantic filtering and retrieval of multimedia content is crucial for
efficient use of the multimedia data repositories, Milind Ramesh Naphade
et.al 2001. Slaney, 2002, proposed a state-of-the-art system which
incorporates a mapping between audio and semantic spaces. Methods are
developed to describe general audio with words (and also predict sounds
given a text query) using a labeled sound set. Adams et al, 2003, have
focused on semantic classification through the identification of meaningful
intermediate-level semantic components using both audio and video
features.
Hui Fang et.al, 2006, proposed indexing and retrieval system of the
visual contents based on feature extracted from the compressed domain.
Direct possessing of the compressed domain spares the decoding time,
which is extremely important when indexing large number of multimedia
archives. A fuzzy categorizing structure has been designed to improve the
retrieval performance. In their experiment, a database that consists of
basketball videos has been constructed for their study. This database
includes three categories: full court match, penalty and close-up. First,
spatial and temporal feature extraction was applied to train the fuzzy
membership functions using the minimum entropy optimal algorithm. Then,
the max composition operation was used to generate a new fuzzy feature to
represent the content of the shots. Finally, the fuzzy-based representation
becomes the indexing feature for the content-based video retrieval system.
The experimental results showed that the proposed algorithm was quite
promising for semantic-based video retrieval.
5
Rachid Benmokhtar et al, 2006, proposed an improved version of
radial basis function (RBF) network based on Evidence Theory (NNET) using
one input layer and two hidden layers and one output layer, to improve
classifier combination and recognition reliability in particular for automatic
semantic-based video content indexing and retrieval. Many combination
schemes have been proposed in the literature according to the type of
information provided by each classifier as well as their training and
adaptation abilities.
Cees et.al, 2007, identified three strategies to select a relevant
detector from thesaurus, namely: text matching, ontology querying, and
semantic visual querying for a given query. They evaluate the methods
against the automatic search task of the TRECVID 2005 video retrieval
benchmark, using news video archive of 85hours in combination with a
thesaurus of 363 machine learned concept detectors. They assessed the
influence of thesaurus size on video search performance, evaluated and
compared the multimodal selection strategies for concept detectors, and
finally discuss their combined potential using oracle fusion.
Hong Lu et.al, 2007, surveyed some of the existing techniques and
systems for content-based indexing and retrieval of two main types of
multimedia data images and videos. In content-based image retrieval, they
have proposed multi-scale color histograms by incorporating color and
spatial information. In content-based video retrieval, they have proposed a
new level of program between story/event and video sequence levels. Video
summarization results are given based on the scene clustering results.
Li Gao et.al, 2009 discussed video sequences as temporal trajectories
via scaling and lower dimensional representation of the video frame
luminance field, and a video trajectory indexing and matching scheme was
developed to support video clip search. Simulation results demonstrated that
6
the proposed approach achieved excellent performance in both response
speed and precision-recall accuracy.
Abdelati Malek Amel et.al, 2010, studied the motion activity descriptor
for shot boundary detection in video sequences. The motion activity
information was extracted in uncompressed domain based on adaptive rood
pattern search (ARPS) algorithm.
Weiming Hu et.al. 2011, presented tutorial and overview of the
landscape of general strategies in visual content-based video indexing and
retrieval, focusing on methods for video structure analysis, including shot
boundary detection, key frame extraction and scene segmentation,
extraction of features including static key frame features, object features
and motion features, video data mining, video annotation, video retrieval
including query interfaces, similarity measure and relevance feedback, and
video browsing.
Xu Chen et.al, 2012, analyzed semantic filtering and retrieval of
multimedia content for efficient use of the multimedia data repositories.
Video query by semantic keywords is one of the most difficult problems in
multimedia data retrieval. The difficulty lies in the mapping between low-
level video representation and high-level semantics. They proposed a
probabilistic framework for semantic video indexing, which can support
filtering and retrieval and facilitate efficient content-based access. To map
low-level features to high-level semantics we propose probabilistic
multimedia objects (multijects). Examples of multijects in movies include
explosion, mountain, beach, outdoor, music etc. Semantic concepts in videos
interact and to model this interaction explicitly, they proposed a network of
multijects (multinet). Using probabilistic models for six site multijects, rocks,
sky, snow, water-body, forestry/greenery and outdoor and using a Bayesian
belief network as the multinet. They demonstrated the application of this
framework to semantic indexing.
7
3. OBJECTIVES
1.To reduce number of features for video indexing and retrieval
mechanism.
2.To extract unambiguous key frames that represent shots and
scenes.
4. METHODOLOGIES
To index and retrieve video using the existing methods of feature
extraction and train and test
1. Back Propagation Neural Network (BPA).
2. Radial Basis Function Neural Network (RBF).
3. Dynamic Time Warping(DTW) with BPA (DTWBPA).
4. Dynamic Time Warping(DTW) with RBF (DTWRBF).
4.1 Back propagation algorithm
The BPA uses steepest-descent method to reach a global minimum.
The number of layers and number of nodes in the hidden layers are decided.
The connections between nodes are initialized with random weights. A
pattern from the training set is presented in the input layer of the network
and the error at the output layer is calculated. The error is propagated
backwards, towards the input layer and the weights are updated. This
procedure is repeated for all the training patterns. This forms one-iteration.
At the end of each iteration, test patterns are presented to ANN, and the
prediction performance of ANN is evaluated. Further training of ANN is
continued till the desired prediction performance is reached.
4.2 Radial basis function
A radial basis function (RBF) is a real-valued function whose value
depends only on the distance from the origin. If a function ‘h’ satisfies the
8
property h(x)=h(||x||), then it is a radial function. Their characteristic
feature is that their response decreases (or increases) monotonically with
distance from a central point. The centre, the distance scale, and the precise
shape of the radial function are parameters of the model, all fixed if it is
linear.
4.3 Dynamic time warping
Dynamic Time Warping (DTW) is a dynamic programming based
procedure used to align sets of sequences of feature vectors and compute a
similarity measure. DTW computes the optimal alignment between
sequences of feature vectors, so that the cumulative distance measure
consisting of local distances between aligned sequences of signals is
minimized. The mapping is mostly non-linear where the input signals are of
different lengths.
5. RESULTS
Fig. 1 Schematic layout of video indexing and retrieval
Sample video has been considered. The author reads out a message in the
video. In this video few more people also make conversation.
Video
segmentation
Scene/Shot
Database
Query
Processor
Content
Based
Analysis
Index
Database
End
User
9
Step 1: Frames are extracted from the video
Step 2: Statistical measures of the distribution of color, Texture (Statistical
description + template matching), Shape (template matching), Lines
Horizontal, Vertical, Diagonal (template matching), Angles 90,
45,120(template matching), Curves Hyperbola, parabola, ellipse, circle
(template matching), Blobs (template matching), Edges (template
matching).
Step 3: Audio track of the frames are extracted and spectrogram features
are obtained (using DTW).
Step 4: Text is extracted from a frame by segmentation of objects using
template matching.
Step 5: Labeling of features for recognition purpose is done.
Step 6: Training and Testing of BPA /RBF using the features obtained in step
2: step 5 is done.
The key frame identification procedure will be presented in the thesis.
Speeches of four persons are recorded in the normal environment. The
candidates (Table 1) are Prasanna-Author (Pra), Purushothaman (Pur),
Rajeswari (Raj), Shwetha (Shwe). The statement was read aloud In today’s
2011 match between India versus West indies, India won by six wickets.
While reading the statement, video recording was done. This short video is
embedded at different frame location of different videos.
The statement was read two times at different instances by prasanna
and stored as (Pra1) and (Pra2). The recordings of Purushothaman as
(Pur1), Rajeswari as (Raj1), and Shwetha as Shwe were done. The
recordings are in stereo format 16 bit. Table 2 shows, the number of speech
recorded at two different instances.
Table 1 Speech Combination Matrix
10
Person
name
Pra1 Pra2 Pur1 Raj1 Shwe
Pra1 √ √ √ √ √
Pra2 √ √ √ √ √
Pur1 √ √ √ √ √
Raj1 √ √ √ √ √
Shwe √ √ √ √ √
Table 2 Recordings of speech at
different instances
Person name Speech1 Speech2
Prasanna √ √
Purushothaman √
Rajeswari √
Shwetha √
Table 3 Plotting Combinations
Person Name
Comparison of matching Score
DTW Error Plot
Pra1 √ √ Pra2 √ √ Pur1 √ √ Raj1 √ √ Shwe √ √
In Row 1 of Table 3, Pra1 has been kept as reference which is
available in the video. Speech of Pra2, Pur1, Raj1, Shwe will be tested to
retrieve Pra1.
11
In Row 2 of Table 3, Pra2 has been kept as reference which is
available in the video. Speech of Pra1, Pur1, Raj1 and Shwe will be tested to
retrieve Pra2.
In Row 3 of Table 3, Pur1 has been kept as reference which is
available in the video. Speech of Pra1, Pra2, Raj1,Shwe will be tested to
retrieve Pur1.
In Row 4 of Table 3, Raj1 has been kept as reference which is
available in the video. Speech of Pra1, Pra2, Pur1, Shwe will be tested to
retrieve Raj1.
In Row 5 of Table 3, Shwe has been kept as reference which is
available in the video. Speech of Pra1,Pra2,Pur1,Raj1 will be tested to
retrieve Pra1.
0 0.5 1 1.5 2 2.5
x 104
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
Samples
Am
plitu
de
Prasanna1
Prasanna2
Puru1Rajeswari1
Shwetha1
Fig. 2 Speech of four candidates
Figure 2 shows the plots of speeches of all the 4 candidates.
12
500 1000 1500
200
400
600
800
1000
1200
1400
1600
1800
Prasanna vs prasanna , same recording
500 1000 1500
200
400
600
800
1000
1200
1400
1600
1800
Fig. 3 DTW matching score (Pra1-Pra1)
Figure 3 presents the matching score of Prasanna speech1 with the
same speech to show that the matching score is perfect.
500 1000 1500
200
400
600
800
1000
1200
1400
1600
1800
Prasanna1 vs Prasanna2, different recording
500 1000 1500
200
400
600
800
1000
1200
1400
1600
1800
Fig. 4 DTW matching score
If speech 2 of Prasanna is used to reterive the Prasanna frame from
the videos,, then the matching score is deviating as shown in Figure 4.
13
500 1000 1500
200
400
600
800
1000
1200
1400
1600
1800
Prasanna1 vs Puru1, different recording
500 1000 1500
200
400
600
800
1000
1200
1400
1600
1800
Fig. 5 DTW matching score
If speech 1 of Purushothaman is used to reterive the Prasanna frame
from the videos,, then the matching score is deviating as shown in Figure 5.
500 1000 1500
200
400
600
800
1000
1200
1400
1600
1800
Prasanna1 vs Rajeswari1, different recording
500 1000 1500
200
400
600
800
1000
1200
1400
1600
1800
Fig. 6 DTW matching score
If speech 1 of Rajeswari is used to reterive the Prasanna frame from
the videos,, then the matching score is deviating as shown in Figure 6.
14
500 1000 1500
500
1000
1500
2000
2500
shwetha1 vs Prasanna1,different recording
500 1000 1500
500
1000
1500
2000
2500
Fig. 7 DTW matching score
If speech 1 of Shwetha is used to reterive the Prasanna frame from
the videos,, then the matching score is deviating as shown in Figure 7.
A. Retrieval of Prasanna with Prasanna1 speech versus other
candidates
0 200 400 600 800 1000 1200 1400 1600 1800 20000
500
1000
1500
2000
WaveFile-1
Wav
eeF
ile-
2
Prasanna1-Prasanna1Prasanna1-Prasanna2Prasanna1-Puru1Prasanna1-Raj1Prasanna1-Shwe
Fig. 8 Comparisons of matching score
Figure 8 presents the matching scores of all the four candidates. The
blue color line indicates a perfect matching if the speech 1 of Prasanna is
15
used for frame retrieval. There is deviations of the matching scores if other
three person’s speeches are used to retrieve speech1 of Prasanna.
0 200 400 600 800 1000 1200 1400 1600 1800 2000-100
0
100
200
300
400
Frames
Err
or
Prasanna1-Prasanna1Prasanna1-Prasanna2Prasanna1-Puru1Prasanna1-Raj1Prasanna1-Shwe
Fig. 9 DTW Error plot
Figure 9 presents the amount of deviations present in the speech of all
the four candidates when compared to the first candidate speech. X- axis
represents the frame with 512 samples each and Y- axis represent amount
of deviations relatively with respect to the reference value’0’. There is lot of
deviation for Prasanna – Rajeswari and Prasanna – Shwetha. The speech
utterances of Rajeswari and Shwetha cannot be used to retrieve the frames
of Prasanna as the matching is not within the limit. However, speech of
Purushothaman can be used to retrieve the frames of Prasanna.
B. Retrieval of Prasanna with Prasanna2 speech versus other
candidates
Figure 10 presents the matching scores of all the four candidates. The
green color line indicates a perfect matching if the speech 2 of Prasanna is
used for frame retrieval. There is deviations of the matching scores if other
three person’s speeches are used to retrieve speech1 of Prasanna.
Figure 11 presents the amount of deviations present in the speech of
all the four candidates when compared to the Prasanna2 speech.
16
0 200 400 600 800 1000 1200 1400 1600 18000
200
400
600
800
1000
1200
1400
1600
1800
WaveFile-1
Wav
eeF
ile-2
Prasanna2-Prasanna1
Prasanna2-Prasanna2
Prasanna2-Puru1Prasanna2-Raj1
Prasanna2-Shwe
Fig. 10 Comparisons of matching score for Prasanna2-other
candidates
Fig. 11 DTW Error plot for Prasanna 2 and other candidates
6. CONCLUSION
In this thesis, 100 videos have been downloaded from internet resources.
The contents of the video cover wide range of topics. Features corresponding
to image, text and audio are extracted by using standard statistical
methods. Labeling is done for the features to train ANN algorithms. We claim
the following:
1. BPA takes time to learn when compared to RBF
2. The retrieval process for a given query is quick as the topology of the
ANN is minimal and requires less memory.
17
THESIS LAYOUT
1.Introduction and scope of the work
2. Video data generation
3. Feature extraction
4. Algorithm implementation for video indexing and retrieval
4.1 Back propagation algorithm
4.2 Radial basis function
4.3 Dynamic time warping
5. Results and Discussions
6.Conclusion
LIST OF PUBLICATIONS
INTERNATIONAL JOURNALS
1. Prasanna S., Dr. Purushothaman S., and Dr. Jothi A., 2010, Video
Indexing using Radial Basis Function Network , Technology Today
(ISSN 2180 – 0987), Vol – II, Issue – II, PP. 395 - 402.
2. Prasanna S., Dr.Purushothaman S. and Dr. Jothi A., 2011,
Implementation Of Dynamic Time Warping For Video Indexing ,
International Journal of Computer Science and Information Security
(IJCSIS) , Vol.9, No.7, PP. 129-134.
INTERNATIONAL CONFERENCE
1. Prasanna S. and Dr. Purushothaman S., “Importance of Semantic Web
in Video Mining”, Proceedings of the International Conference on
Intelligent Science and Technology, Sun College of Engineering and
18
Technology, Nagerkoil, Kanyakumari District, Tamilnadu, India, 18-19
June 2009,PP 242 – 247.
2. Prasanna S. and Dr. Purushothaman S., “Video Indexing Using
Multimodal Approach”, International Conference Proceedings on
Innovations in Contemporary IT Research at Quaid-E-Millath
Government College for Women, Anna Salai, Chennai on 17th and 18th
February 2012, pp.
NATIONAL CONFERENCE
1. Prasanna S. and Dr. Purushothaman S., “Content Based Shot Indexing
In Video Reterival” , National Conference on Information and
Software Engineering (NCISE 2012) at Aarupadai Veedu Institute of
Technology, Vinayaka Missions University on 9th & 10th March 2012,
pp.
REFERENCES
[1] Abdelati Malek Amel, Ben Abdelali Abdessalem and Mtibaa Abdellatif,
2010, Video Shot Boundary Detection Using Motion Activity Descriptor,
Journal of Telecommunications, Vol. 2, Issue 1, pp. 54-59.
[2] Adams W.H., Iyengar G., Lin C.Y., Naphade M. R., Neti C., Nock H. J.,
and Smith J.R., 2003, Semantic Indexing of Multimedia using Audio,
Text and Visual Cues, EURASIP J. Appl. Signal Processing, pp. 170-
185.
[3] Berna Erol and Faouzi Kossentini, 2000, Automatic Key Video Object
Plane Selection Using The Shape Information In The Mpeg-4
19
Compressed Domain, IEEE Transactions on Multimedia, Vol. 2, No. 2,
pp. 129-138.
[4] Cees G. M. Snoek, Bouke Huurnink, Laura Hollink, Maarten De Rijke,
Guus Schreiber and Marcel Worring, 2007, Adding Semantics To
Detectors For Video Retrieval, IEEE Transactions on Multimedia, Vol. 9,
No. 5, pp. 975-986.
[5] Holmes J. N. and Holmes W. J., 2001, Speech synthesis and
recognition, Taylor & Francis, second edition.
[6] Hong Lu, Xiangyang Xue and Yap-Peng Tan, 2007, Content-Based
Image and Video Indexing and Retrieval, Springer–Verlag, pp. 118-
129.
[7] Hui Fang, Rami Qahwaji, and Jianmin Jiang, 2006, Video Indexing and
Retrieval in Compressed Domain Using Fuzzy-Categorization, Springer-
Verlag Berlin Heidelberg, pp. 1143–1150.
[8] Li Gao, Zhu Li and Aggelos Katsaggelos, 2009, An Efficient Video
Indexing and Retrieval Algorithm Using The Luminance Field
Trajectory Modeling, IEEE Transactions on Circuits and Systems for
Video Technology, Vol. 19, No. 10, pp. 1566-1570.
[9] Liu M., and Wan C., A study on content-based classification and
retrieval of audio database, in Proc. International Database
Engineering and Applications Symposium, 2001, pp. 339–345.
[10] Milind Ramesh Naphade and Thomas S. Huang, 2001, A Probabilistic
Framework For Semantic Video Indexing, Filtering and Retrieval, IEEE
Transactions On Multimedia, Vol. 3, No. 1, pp. 141-151.
[11] Rachid Benmokhtar and Benoit Huet, Neural Network Combining
Classifier Based on Dempster-Shafer Theory, Proceedings of The
International Multi Conference on Computer Science and Information
Technology, 2006, pp. 3–10.
20
[12] Slaney M., Mixtures of probability experts for audio retrieval and
indexing, in Proc. IEEE International Conference on Multimedia and
Expo, Vol. 1, 2002 pp. 345–348.
[13] Slaney M., 2002, Semantic-audio retrieval, in Proc. 2002 IEEE
International Conference on Acoustics, Speech and Signal Processing,
Vol. 4, pp. IV4108–11.
[14] Sofia Tsekeridou, Ioannis Pitas, 2004, Content-based video parsing
and indexing based on Audio-visual interaction, IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 11, No. 4, pp. 522-
535.
[15] Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng and Stephen
Maybank, 2011, A Survey on Visual Content-Based Video Indexing and
Retrieval, IEEE Transactions on Systems, Man and Cybernetics—Part
C: Applications And Reviews, Vol. 41, No. 6, pp. 797-819.
[16] Xu Chen, Alfred O. Hero, and Silvio Savarese, 2012, Multimodal
Video Indexing and Retrieval Using Directed Information, IEEE
Transactions On Multimedia, Vol. 14, No. 1, pp. 1-14