20
1 INTELLIGENT MULTIMODEL CONTENT BASED VIDEO RETRIEVAL Synopsis Submitted by PRASANNA S. In partial fulfillment for the award of the degree Of DOCTOR OF PHILOSOPHY Under the guidance of Dr.Purushothaman S. BE, ME, PhD (IIT Madras) DEPARTMENT OF MCA SCHOOL OF COMPUTING SCIENCES PALLAVARAM, CHENNAI – 600 117.

INTELLIGENT MULTIMODEL CONTENT BASED VIDEO RETRIEVALshodhganga.inflibnet.ac.in/bitstream/10603/9363/15/15_synopsis.pdf · Xu Chen et.al, 2012, analyzed semantic filtering and retrieval

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

1

INTELLIGENT MULTIMODEL CONTENT BASED VIDEO

RETRIEVAL

Synopsis

Submitted by

PRASANNA S.

In partial fulfillment for the award of the degree

Of

DOCTOR OF PHILOSOPHY

Under the guidance of

Dr.Purushothaman S.

BE, ME, PhD (IIT Madras)

DEPARTMENT OF MCA

SCHOOL OF COMPUTING SCIENCES

PALLAVARAM, CHENNAI – 600 117.

2

1. INTRODUCTION

This thesis presents retrieving a video from a given database using

artificial neural network (ANN) methods. The proposed algorithms use

features of the videos for training and testing ANN. The features of frames of

a video are extracted from the contents in the form of text, audio and image

in the frames. In this work, multimodel refers to using more than one

information to retrieve a video. Applications of video retrieval include the

News archives which use large amounts of video daily and must be archived.

2. REVIEW OF LITERATURE

Video will be one of the key issues in the upcoming Internet evolution

in infotainment and education Converting raw video streams into highly and

thoroughly structured and indexed, web-ready, database-driven information

entities is a must. Information databases have evolved from simple text to

multimedia with video, audio, and text. The query mechanism for such a

video database is similar in concept compared to a textual database. Object

searching is analogous to word searching; scene browsing is similar to

paragraph searching; video indexing is comparable to text indexing or

bookmarking. Implementation for video content based searching is very

different and much more difficult than the query mechanism for a textual

database. Scene change detection technique can often be applied for scene

browsing and automatic and intelligent video indexing of video sequences for

video databases. Once individual video scenes are identified, we can use

content based indexing mechanisms (such as indexing by object texture,

shape, color, motion, etc.) to index and query image contents in each video

scene.

There are many information carriers in a video stream, as is the visual

content, the narrative or speech part, text captions. Visual content remains

3

the most important. We prefer representative image from a long scene with

little -or no- change. This process is the key frame extraction. Two kinds of

key frame extraction strategies have been developed and used by various

researchers. The simplest way is to select one or several frames from each

segmented shot. Some use the first frame of each shot as the representative

frame, i.e., the key frame of the shot. Others may use a random one, the

last one or the middle one as the prototypical frame.

Lots of research works have been done by earlier researchers on

retrieval of videos. Web browsers and email service providers have

developed many indexing methods and retrieval algorithms for retrieving

videos. In addition to the existing development we are focusing on

implementing intelligent methods like ANN.

Berna Erol et.al, 2000 Object-based video representation, such as the

one suggested by the MPEG-4 standard, offers a framework that is better

suited for object-based video indexing and retrieval. In such a framework,

the concept of a “key frame” is replaced by that of a “key video object

plane.” The authors proposed a method for key video object plane selection

using the shape information in the MPEG-4 compressed domain. The shape

of the video object (VO) is approximated using the shape coding modes of I,

P, and B video object planes (VOPs) without decoding the shape information

in the MPEG-4 bit stream. Two popular shape distance measures, the

Hamming and Hausdorff distance measures were modified to measure the

similarities between the approximated shapes of the video objects.

Approaches mainly differ in the set of acoustic features used to

represent the audio signal and the classification technique applied, Sofia et

al, 2001. In speech research, features such as Mel frequency cepstral

coefficients (MFCCs) or linear prediction coefficients (LPCs) have been

demonstrated to provide good representations of a speech signal allowing

for better discrimination than temporal or frequency based features alone,

4

Holmes et al, 2001. Liu et al, 2001 conclude from their study that cepstral

features such as MFCCs perform better than temporal or frequency based

features and advocate their use for general audio tasks particularly when the

number of audio classes is large.

Semantic filtering and retrieval of multimedia content is crucial for

efficient use of the multimedia data repositories, Milind Ramesh Naphade

et.al 2001. Slaney, 2002, proposed a state-of-the-art system which

incorporates a mapping between audio and semantic spaces. Methods are

developed to describe general audio with words (and also predict sounds

given a text query) using a labeled sound set. Adams et al, 2003, have

focused on semantic classification through the identification of meaningful

intermediate-level semantic components using both audio and video

features.

Hui Fang et.al, 2006, proposed indexing and retrieval system of the

visual contents based on feature extracted from the compressed domain.

Direct possessing of the compressed domain spares the decoding time,

which is extremely important when indexing large number of multimedia

archives. A fuzzy categorizing structure has been designed to improve the

retrieval performance. In their experiment, a database that consists of

basketball videos has been constructed for their study. This database

includes three categories: full court match, penalty and close-up. First,

spatial and temporal feature extraction was applied to train the fuzzy

membership functions using the minimum entropy optimal algorithm. Then,

the max composition operation was used to generate a new fuzzy feature to

represent the content of the shots. Finally, the fuzzy-based representation

becomes the indexing feature for the content-based video retrieval system.

The experimental results showed that the proposed algorithm was quite

promising for semantic-based video retrieval.

5

Rachid Benmokhtar et al, 2006, proposed an improved version of

radial basis function (RBF) network based on Evidence Theory (NNET) using

one input layer and two hidden layers and one output layer, to improve

classifier combination and recognition reliability in particular for automatic

semantic-based video content indexing and retrieval. Many combination

schemes have been proposed in the literature according to the type of

information provided by each classifier as well as their training and

adaptation abilities.

Cees et.al, 2007, identified three strategies to select a relevant

detector from thesaurus, namely: text matching, ontology querying, and

semantic visual querying for a given query. They evaluate the methods

against the automatic search task of the TRECVID 2005 video retrieval

benchmark, using news video archive of 85hours in combination with a

thesaurus of 363 machine learned concept detectors. They assessed the

influence of thesaurus size on video search performance, evaluated and

compared the multimodal selection strategies for concept detectors, and

finally discuss their combined potential using oracle fusion.

Hong Lu et.al, 2007, surveyed some of the existing techniques and

systems for content-based indexing and retrieval of two main types of

multimedia data images and videos. In content-based image retrieval, they

have proposed multi-scale color histograms by incorporating color and

spatial information. In content-based video retrieval, they have proposed a

new level of program between story/event and video sequence levels. Video

summarization results are given based on the scene clustering results.

Li Gao et.al, 2009 discussed video sequences as temporal trajectories

via scaling and lower dimensional representation of the video frame

luminance field, and a video trajectory indexing and matching scheme was

developed to support video clip search. Simulation results demonstrated that

6

the proposed approach achieved excellent performance in both response

speed and precision-recall accuracy.

Abdelati Malek Amel et.al, 2010, studied the motion activity descriptor

for shot boundary detection in video sequences. The motion activity

information was extracted in uncompressed domain based on adaptive rood

pattern search (ARPS) algorithm.

Weiming Hu et.al. 2011, presented tutorial and overview of the

landscape of general strategies in visual content-based video indexing and

retrieval, focusing on methods for video structure analysis, including shot

boundary detection, key frame extraction and scene segmentation,

extraction of features including static key frame features, object features

and motion features, video data mining, video annotation, video retrieval

including query interfaces, similarity measure and relevance feedback, and

video browsing.

Xu Chen et.al, 2012, analyzed semantic filtering and retrieval of

multimedia content for efficient use of the multimedia data repositories.

Video query by semantic keywords is one of the most difficult problems in

multimedia data retrieval. The difficulty lies in the mapping between low-

level video representation and high-level semantics. They proposed a

probabilistic framework for semantic video indexing, which can support

filtering and retrieval and facilitate efficient content-based access. To map

low-level features to high-level semantics we propose probabilistic

multimedia objects (multijects). Examples of multijects in movies include

explosion, mountain, beach, outdoor, music etc. Semantic concepts in videos

interact and to model this interaction explicitly, they proposed a network of

multijects (multinet). Using probabilistic models for six site multijects, rocks,

sky, snow, water-body, forestry/greenery and outdoor and using a Bayesian

belief network as the multinet. They demonstrated the application of this

framework to semantic indexing.

7

3. OBJECTIVES

1.To reduce number of features for video indexing and retrieval

mechanism.

2.To extract unambiguous key frames that represent shots and

scenes.

4. METHODOLOGIES

To index and retrieve video using the existing methods of feature

extraction and train and test

1. Back Propagation Neural Network (BPA).

2. Radial Basis Function Neural Network (RBF).

3. Dynamic Time Warping(DTW) with BPA (DTWBPA).

4. Dynamic Time Warping(DTW) with RBF (DTWRBF).

4.1 Back propagation algorithm

The BPA uses steepest-descent method to reach a global minimum.

The number of layers and number of nodes in the hidden layers are decided.

The connections between nodes are initialized with random weights. A

pattern from the training set is presented in the input layer of the network

and the error at the output layer is calculated. The error is propagated

backwards, towards the input layer and the weights are updated. This

procedure is repeated for all the training patterns. This forms one-iteration.

At the end of each iteration, test patterns are presented to ANN, and the

prediction performance of ANN is evaluated. Further training of ANN is

continued till the desired prediction performance is reached.

4.2 Radial basis function

A radial basis function (RBF) is a real-valued function whose value

depends only on the distance from the origin. If a function ‘h’ satisfies the

8

property h(x)=h(||x||), then it is a radial function. Their characteristic

feature is that their response decreases (or increases) monotonically with

distance from a central point. The centre, the distance scale, and the precise

shape of the radial function are parameters of the model, all fixed if it is

linear.

4.3 Dynamic time warping

Dynamic Time Warping (DTW) is a dynamic programming based

procedure used to align sets of sequences of feature vectors and compute a

similarity measure. DTW computes the optimal alignment between

sequences of feature vectors, so that the cumulative distance measure

consisting of local distances between aligned sequences of signals is

minimized. The mapping is mostly non-linear where the input signals are of

different lengths.

5. RESULTS

Fig. 1 Schematic layout of video indexing and retrieval

Sample video has been considered. The author reads out a message in the

video. In this video few more people also make conversation.

Video

segmentation

Scene/Shot

Database

Query

Processor

Content

Based

Analysis

Index

Database

End

User

9

Step 1: Frames are extracted from the video

Step 2: Statistical measures of the distribution of color, Texture (Statistical

description + template matching), Shape (template matching), Lines

Horizontal, Vertical, Diagonal (template matching), Angles 90,

45,120(template matching), Curves Hyperbola, parabola, ellipse, circle

(template matching), Blobs (template matching), Edges (template

matching).

Step 3: Audio track of the frames are extracted and spectrogram features

are obtained (using DTW).

Step 4: Text is extracted from a frame by segmentation of objects using

template matching.

Step 5: Labeling of features for recognition purpose is done.

Step 6: Training and Testing of BPA /RBF using the features obtained in step

2: step 5 is done.

The key frame identification procedure will be presented in the thesis.

Speeches of four persons are recorded in the normal environment. The

candidates (Table 1) are Prasanna-Author (Pra), Purushothaman (Pur),

Rajeswari (Raj), Shwetha (Shwe). The statement was read aloud In today’s

2011 match between India versus West indies, India won by six wickets.

While reading the statement, video recording was done. This short video is

embedded at different frame location of different videos.

The statement was read two times at different instances by prasanna

and stored as (Pra1) and (Pra2). The recordings of Purushothaman as

(Pur1), Rajeswari as (Raj1), and Shwetha as Shwe were done. The

recordings are in stereo format 16 bit. Table 2 shows, the number of speech

recorded at two different instances.

Table 1 Speech Combination Matrix

10

Person

name

Pra1 Pra2 Pur1 Raj1 Shwe

Pra1 √ √ √ √ √

Pra2 √ √ √ √ √

Pur1 √ √ √ √ √

Raj1 √ √ √ √ √

Shwe √ √ √ √ √

Table 2 Recordings of speech at

different instances

Person name Speech1 Speech2

Prasanna √ √

Purushothaman √

Rajeswari √

Shwetha √

Table 3 Plotting Combinations

Person Name

Comparison of matching Score

DTW Error Plot

Pra1 √ √ Pra2 √ √ Pur1 √ √ Raj1 √ √ Shwe √ √

In Row 1 of Table 3, Pra1 has been kept as reference which is

available in the video. Speech of Pra2, Pur1, Raj1, Shwe will be tested to

retrieve Pra1.

11

In Row 2 of Table 3, Pra2 has been kept as reference which is

available in the video. Speech of Pra1, Pur1, Raj1 and Shwe will be tested to

retrieve Pra2.

In Row 3 of Table 3, Pur1 has been kept as reference which is

available in the video. Speech of Pra1, Pra2, Raj1,Shwe will be tested to

retrieve Pur1.

In Row 4 of Table 3, Raj1 has been kept as reference which is

available in the video. Speech of Pra1, Pra2, Pur1, Shwe will be tested to

retrieve Raj1.

In Row 5 of Table 3, Shwe has been kept as reference which is

available in the video. Speech of Pra1,Pra2,Pur1,Raj1 will be tested to

retrieve Pra1.

0 0.5 1 1.5 2 2.5

x 104

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

Samples

Am

plitu

de

Prasanna1

Prasanna2

Puru1Rajeswari1

Shwetha1

Fig. 2 Speech of four candidates

Figure 2 shows the plots of speeches of all the 4 candidates.

12

500 1000 1500

200

400

600

800

1000

1200

1400

1600

1800

Prasanna vs prasanna , same recording

500 1000 1500

200

400

600

800

1000

1200

1400

1600

1800

Fig. 3 DTW matching score (Pra1-Pra1)

Figure 3 presents the matching score of Prasanna speech1 with the

same speech to show that the matching score is perfect.

500 1000 1500

200

400

600

800

1000

1200

1400

1600

1800

Prasanna1 vs Prasanna2, different recording

500 1000 1500

200

400

600

800

1000

1200

1400

1600

1800

Fig. 4 DTW matching score

If speech 2 of Prasanna is used to reterive the Prasanna frame from

the videos,, then the matching score is deviating as shown in Figure 4.

13

500 1000 1500

200

400

600

800

1000

1200

1400

1600

1800

Prasanna1 vs Puru1, different recording

500 1000 1500

200

400

600

800

1000

1200

1400

1600

1800

Fig. 5 DTW matching score

If speech 1 of Purushothaman is used to reterive the Prasanna frame

from the videos,, then the matching score is deviating as shown in Figure 5.

500 1000 1500

200

400

600

800

1000

1200

1400

1600

1800

Prasanna1 vs Rajeswari1, different recording

500 1000 1500

200

400

600

800

1000

1200

1400

1600

1800

Fig. 6 DTW matching score

If speech 1 of Rajeswari is used to reterive the Prasanna frame from

the videos,, then the matching score is deviating as shown in Figure 6.

14

500 1000 1500

500

1000

1500

2000

2500

shwetha1 vs Prasanna1,different recording

500 1000 1500

500

1000

1500

2000

2500

Fig. 7 DTW matching score

If speech 1 of Shwetha is used to reterive the Prasanna frame from

the videos,, then the matching score is deviating as shown in Figure 7.

A. Retrieval of Prasanna with Prasanna1 speech versus other

candidates

0 200 400 600 800 1000 1200 1400 1600 1800 20000

500

1000

1500

2000

WaveFile-1

Wav

eeF

ile-

2

Prasanna1-Prasanna1Prasanna1-Prasanna2Prasanna1-Puru1Prasanna1-Raj1Prasanna1-Shwe

Fig. 8 Comparisons of matching score

Figure 8 presents the matching scores of all the four candidates. The

blue color line indicates a perfect matching if the speech 1 of Prasanna is

15

used for frame retrieval. There is deviations of the matching scores if other

three person’s speeches are used to retrieve speech1 of Prasanna.

0 200 400 600 800 1000 1200 1400 1600 1800 2000-100

0

100

200

300

400

Frames

Err

or

Prasanna1-Prasanna1Prasanna1-Prasanna2Prasanna1-Puru1Prasanna1-Raj1Prasanna1-Shwe

Fig. 9 DTW Error plot

Figure 9 presents the amount of deviations present in the speech of all

the four candidates when compared to the first candidate speech. X- axis

represents the frame with 512 samples each and Y- axis represent amount

of deviations relatively with respect to the reference value’0’. There is lot of

deviation for Prasanna – Rajeswari and Prasanna – Shwetha. The speech

utterances of Rajeswari and Shwetha cannot be used to retrieve the frames

of Prasanna as the matching is not within the limit. However, speech of

Purushothaman can be used to retrieve the frames of Prasanna.

B. Retrieval of Prasanna with Prasanna2 speech versus other

candidates

Figure 10 presents the matching scores of all the four candidates. The

green color line indicates a perfect matching if the speech 2 of Prasanna is

used for frame retrieval. There is deviations of the matching scores if other

three person’s speeches are used to retrieve speech1 of Prasanna.

Figure 11 presents the amount of deviations present in the speech of

all the four candidates when compared to the Prasanna2 speech.

16

0 200 400 600 800 1000 1200 1400 1600 18000

200

400

600

800

1000

1200

1400

1600

1800

WaveFile-1

Wav

eeF

ile-2

Prasanna2-Prasanna1

Prasanna2-Prasanna2

Prasanna2-Puru1Prasanna2-Raj1

Prasanna2-Shwe

Fig. 10 Comparisons of matching score for Prasanna2-other

candidates

Fig. 11 DTW Error plot for Prasanna 2 and other candidates

6. CONCLUSION

In this thesis, 100 videos have been downloaded from internet resources.

The contents of the video cover wide range of topics. Features corresponding

to image, text and audio are extracted by using standard statistical

methods. Labeling is done for the features to train ANN algorithms. We claim

the following:

1. BPA takes time to learn when compared to RBF

2. The retrieval process for a given query is quick as the topology of the

ANN is minimal and requires less memory.

17

THESIS LAYOUT

1.Introduction and scope of the work

2. Video data generation

3. Feature extraction

4. Algorithm implementation for video indexing and retrieval

4.1 Back propagation algorithm

4.2 Radial basis function

4.3 Dynamic time warping

5. Results and Discussions

6.Conclusion

LIST OF PUBLICATIONS

INTERNATIONAL JOURNALS

1. Prasanna S., Dr. Purushothaman S., and Dr. Jothi A., 2010, Video

Indexing using Radial Basis Function Network , Technology Today

(ISSN 2180 – 0987), Vol – II, Issue – II, PP. 395 - 402.

2. Prasanna S., Dr.Purushothaman S. and Dr. Jothi A., 2011,

Implementation Of Dynamic Time Warping For Video Indexing ,

International Journal of Computer Science and Information Security

(IJCSIS) , Vol.9, No.7, PP. 129-134.

INTERNATIONAL CONFERENCE

1. Prasanna S. and Dr. Purushothaman S., “Importance of Semantic Web

in Video Mining”, Proceedings of the International Conference on

Intelligent Science and Technology, Sun College of Engineering and

18

Technology, Nagerkoil, Kanyakumari District, Tamilnadu, India, 18-19

June 2009,PP 242 – 247.

2. Prasanna S. and Dr. Purushothaman S., “Video Indexing Using

Multimodal Approach”, International Conference Proceedings on

Innovations in Contemporary IT Research at Quaid-E-Millath

Government College for Women, Anna Salai, Chennai on 17th and 18th

February 2012, pp.

NATIONAL CONFERENCE

1. Prasanna S. and Dr. Purushothaman S., “Content Based Shot Indexing

In Video Reterival” , National Conference on Information and

Software Engineering (NCISE 2012) at Aarupadai Veedu Institute of

Technology, Vinayaka Missions University on 9th & 10th March 2012,

pp.

REFERENCES

[1] Abdelati Malek Amel, Ben Abdelali Abdessalem and Mtibaa Abdellatif,

2010, Video Shot Boundary Detection Using Motion Activity Descriptor,

Journal of Telecommunications, Vol. 2, Issue 1, pp. 54-59.

[2] Adams W.H., Iyengar G., Lin C.Y., Naphade M. R., Neti C., Nock H. J.,

and Smith J.R., 2003, Semantic Indexing of Multimedia using Audio,

Text and Visual Cues, EURASIP J. Appl. Signal Processing, pp. 170-

185.

[3] Berna Erol and Faouzi Kossentini, 2000, Automatic Key Video Object

Plane Selection Using The Shape Information In The Mpeg-4

19

Compressed Domain, IEEE Transactions on Multimedia, Vol. 2, No. 2,

pp. 129-138.

[4] Cees G. M. Snoek, Bouke Huurnink, Laura Hollink, Maarten De Rijke,

Guus Schreiber and Marcel Worring, 2007, Adding Semantics To

Detectors For Video Retrieval, IEEE Transactions on Multimedia, Vol. 9,

No. 5, pp. 975-986.

[5] Holmes J. N. and Holmes W. J., 2001, Speech synthesis and

recognition, Taylor & Francis, second edition.

[6] Hong Lu, Xiangyang Xue and Yap-Peng Tan, 2007, Content-Based

Image and Video Indexing and Retrieval, Springer–Verlag, pp. 118-

129.

[7] Hui Fang, Rami Qahwaji, and Jianmin Jiang, 2006, Video Indexing and

Retrieval in Compressed Domain Using Fuzzy-Categorization, Springer-

Verlag Berlin Heidelberg, pp. 1143–1150.

[8] Li Gao, Zhu Li and Aggelos Katsaggelos, 2009, An Efficient Video

Indexing and Retrieval Algorithm Using The Luminance Field

Trajectory Modeling, IEEE Transactions on Circuits and Systems for

Video Technology, Vol. 19, No. 10, pp. 1566-1570.

[9] Liu M., and Wan C., A study on content-based classification and

retrieval of audio database, in Proc. International Database

Engineering and Applications Symposium, 2001, pp. 339–345.

[10] Milind Ramesh Naphade and Thomas S. Huang, 2001, A Probabilistic

Framework For Semantic Video Indexing, Filtering and Retrieval, IEEE

Transactions On Multimedia, Vol. 3, No. 1, pp. 141-151.

[11] Rachid Benmokhtar and Benoit Huet, Neural Network Combining

Classifier Based on Dempster-Shafer Theory, Proceedings of The

International Multi Conference on Computer Science and Information

Technology, 2006, pp. 3–10.

20

[12] Slaney M., Mixtures of probability experts for audio retrieval and

indexing, in Proc. IEEE International Conference on Multimedia and

Expo, Vol. 1, 2002 pp. 345–348.

[13] Slaney M., 2002, Semantic-audio retrieval, in Proc. 2002 IEEE

International Conference on Acoustics, Speech and Signal Processing,

Vol. 4, pp. IV4108–11.

[14] Sofia Tsekeridou, Ioannis Pitas, 2004, Content-based video parsing

and indexing based on Audio-visual interaction, IEEE Transactions on

Circuits and Systems for Video Technology, Vol. 11, No. 4, pp. 522-

535.

[15] Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng and Stephen

Maybank, 2011, A Survey on Visual Content-Based Video Indexing and

Retrieval, IEEE Transactions on Systems, Man and Cybernetics—Part

C: Applications And Reviews, Vol. 41, No. 6, pp. 797-819.

[16] Xu Chen, Alfred O. Hero, and Silvio Savarese, 2012, Multimodal

Video Indexing and Retrieval Using Directed Information, IEEE

Transactions On Multimedia, Vol. 14, No. 1, pp. 1-14