18
Effective content-based video retrieval using pattern-indexing and matching techniques Ja-Hwung Su, Yu-Ting Huang, Hsin-Ho Yeh, Vincent S. Tseng * Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, ROC article info Keywords: Content-based video retrieval Temporal pattern Sequence matching Pattern-based search Fast-pattern-index tree abstract Recently, multimedia data grows rapidly due to the advanced multimedia capturing devices, such as dig- ital video recorder, mobile camera and so on. Since conventional query-by-text retrieval cannot satisfy users’ requirements in finding the desired videos effectively, content-based video retrieval is regarded as one of the most practical solutions to improve the retrieval quality. In addition, video retrieval using query-by-image is not successful in associating the videos with user’s interest either. In this paper, we propose an innovative method to achieve the high quality of content-based video retrieval by discovering the temporal patterns in the video contents. On basis of the discovered temporal patterns, an efficient indexing technique and an effective sequence matching technique are integrated to reduce the compu- tation cost and to raise the retrieval accuracy, respectively. Experimental results reveal that our approach is very promising in enhancing content-based video retrieval in terms of efficiency and effectiveness. Ó 2009 Elsevier Ltd. All rights reserved. 1. Introduction In recent years, advanced digital capturing technology leads to the rapid growth of digital data. Through the ease of communica- tion tools, millions of multimedia data are exchanged on the Inter- net at any time. Hence, knowledge discovery from the massive amount of multimedia data, so-called multimedia mining, has been the focus of attention over the past few years. For multimedia mining, compound and complex multimedia data are usually orga- nized into the multimedia repositories by multimedia conceptual- izing techniques, such as classification, annotation (Tseng, Su, Huang, & Chen, 2008; Virga & Duygulu, 2005) and so on. Behind the multimedia conceptualizing techniques, the main perspective is that the researchers make attempts to satisfy the users’ semantic demands by automatically bridging human concepts and low-level features. Nevertheless, so far, very few studies have been successful in modeling the relationships between the complex low-level fea- tures and the diverse human concepts. For example, two similar videos annotated by different conceptual descriptions possibly re- sult in the large gap between the user’s intention and multimedia search results. Also, video conceptualization is much more difficult than image conceptualization since videos consist of multiple mul- timedia contents, including image, audio and text. As a result, a vi- deo can be viewed as a set of sequential images, which contains a lot of various concepts. Even though several studies have been made on annotating the image frames in a video, the concepts in the image frames cannot still represent the whole video. In addi- tion, based on manual descriptions, almost of on-line search en- gines, such as Youtube, Google, Yahoo, MSN, etc., provide the users with textual-based multimedia search service. However, it actually cannot precisely touch the user’s mind. Fig. 1 is a real example illustrating that the search results are almost incorrect to the query ‘‘Racing Car” in user’s mind. To prevent video retrieval from the plight of textual-based vi- deo retrieval, content-based video retrieval (CBVR) (Dimitrova et al., 2002; Gaughan, Smeaton, Gurrin, Lee, & Mc Donald, 2003; Rautiainen, Ojala, & Seppänen, 2004; Zhu, Elmagarmid, Xue, Wu, & Catlin, 2005) has been brought to researchers’ attention for a long time. Without considering the identification of the query terms, the users can obtain their desired videos by submitting her/his interested video clip. However, the problem of high cost for computing the visual feature similarities among the videos re- mains unsolved. Hence, in this paper, we propose an innovative method to achieve the high quality of content-based video retrie- val by mining the temporal patterns in the video contents. As a whole, the main concentration of the proposed method is: (1) the construction of a pattern-based index for efficient retrieval, namely fast-pattern-index tree, and (2) the unique search strategy for effective retrieval, namely pattern-based search. In our method, without any query terms, the most relevant videos can be found in a large-scale video database through the analysis of video contents and the comparisons of simplified visual-patterns. Empirical eval- uations show that our approach can bring out better results than other methods for content-based video retrieval. 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.12.003 * Corresponding author. Tel.: +886 6 2757575; fax: +886 6 2747076. E-mail address: [email protected] (V.S. Tseng). Expert Systems with Applications 37 (2010) 5068–5085 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Embed Size (px)

DESCRIPTION

content web base retrival

Citation preview

Page 1: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Expert Systems with Applications 37 (2010) 5068–5085

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Effective content-based video retrieval using pattern-indexingand matching techniques

Ja-Hwung Su, Yu-Ting Huang, Hsin-Ho Yeh, Vincent S. Tseng *

Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, ROC

a r t i c l e i n f o

Keywords:Content-based video retrievalTemporal patternSequence matchingPattern-based searchFast-pattern-index tree

0957-4174/$ - see front matter � 2009 Elsevier Ltd. Adoi:10.1016/j.eswa.2009.12.003

* Corresponding author. Tel.: +886 6 2757575; fax:E-mail address: [email protected] (V.S. T

a b s t r a c t

Recently, multimedia data grows rapidly due to the advanced multimedia capturing devices, such as dig-ital video recorder, mobile camera and so on. Since conventional query-by-text retrieval cannot satisfyusers’ requirements in finding the desired videos effectively, content-based video retrieval is regardedas one of the most practical solutions to improve the retrieval quality. In addition, video retrieval usingquery-by-image is not successful in associating the videos with user’s interest either. In this paper, wepropose an innovative method to achieve the high quality of content-based video retrieval by discoveringthe temporal patterns in the video contents. On basis of the discovered temporal patterns, an efficientindexing technique and an effective sequence matching technique are integrated to reduce the compu-tation cost and to raise the retrieval accuracy, respectively. Experimental results reveal that our approachis very promising in enhancing content-based video retrieval in terms of efficiency and effectiveness.

� 2009 Elsevier Ltd. All rights reserved.

1. Introduction

In recent years, advanced digital capturing technology leads tothe rapid growth of digital data. Through the ease of communica-tion tools, millions of multimedia data are exchanged on the Inter-net at any time. Hence, knowledge discovery from the massiveamount of multimedia data, so-called multimedia mining, hasbeen the focus of attention over the past few years. For multimediamining, compound and complex multimedia data are usually orga-nized into the multimedia repositories by multimedia conceptual-izing techniques, such as classification, annotation (Tseng, Su,Huang, & Chen, 2008; Virga & Duygulu, 2005) and so on. Behindthe multimedia conceptualizing techniques, the main perspectiveis that the researchers make attempts to satisfy the users’ semanticdemands by automatically bridging human concepts and low-levelfeatures.

Nevertheless, so far, very few studies have been successful inmodeling the relationships between the complex low-level fea-tures and the diverse human concepts. For example, two similarvideos annotated by different conceptual descriptions possibly re-sult in the large gap between the user’s intention and multimediasearch results. Also, video conceptualization is much more difficultthan image conceptualization since videos consist of multiple mul-timedia contents, including image, audio and text. As a result, a vi-deo can be viewed as a set of sequential images, which contains alot of various concepts. Even though several studies have been

ll rights reserved.

+886 6 2747076.seng).

made on annotating the image frames in a video, the concepts inthe image frames cannot still represent the whole video. In addi-tion, based on manual descriptions, almost of on-line search en-gines, such as Youtube, Google, Yahoo, MSN, etc., provide theusers with textual-based multimedia search service. However, itactually cannot precisely touch the user’s mind. Fig. 1 is a realexample illustrating that the search results are almost incorrectto the query ‘‘Racing Car” in user’s mind.

To prevent video retrieval from the plight of textual-based vi-deo retrieval, content-based video retrieval (CBVR) (Dimitrovaet al., 2002; Gaughan, Smeaton, Gurrin, Lee, & Mc Donald, 2003;Rautiainen, Ojala, & Seppänen, 2004; Zhu, Elmagarmid, Xue, Wu,& Catlin, 2005) has been brought to researchers’ attention for along time. Without considering the identification of the queryterms, the users can obtain their desired videos by submittingher/his interested video clip. However, the problem of high costfor computing the visual feature similarities among the videos re-mains unsolved. Hence, in this paper, we propose an innovativemethod to achieve the high quality of content-based video retrie-val by mining the temporal patterns in the video contents. As awhole, the main concentration of the proposed method is: (1)the construction of a pattern-based index for efficient retrieval,namely fast-pattern-index tree, and (2) the unique search strategyfor effective retrieval, namely pattern-based search. In our method,without any query terms, the most relevant videos can be found ina large-scale video database through the analysis of video contentsand the comparisons of simplified visual-patterns. Empirical eval-uations show that our approach can bring out better results thanother methods for content-based video retrieval.

Page 2: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Fig. 1. Example for textual-based search results.

J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085 5069

The remaining of this paper is structured as follows. Previouswork is reviewed and described in Section 2. In Section 3, we ex-plain the notion of our proposed method for content-based videoretrieval in detail. Empirical evaluations are illustrated in Section4. Finally, conclusions and future work are elaborated in Section 5.

2. Related work

Search engine has been widely used as the platform of knowl-edge discovery from the web in the last few decades. Yet, very littleattention has not been given to video retrieval until the popularityof digital capturing devices and communication tools. In order tofacilitate textual-based video retrieval, the most natural way ismanual annotation. However, manual annotation costs expen-sively due to the massive amount of video contents. To this end,a considerable number of past studies were conducted on auto-mated semantic videos (Tseng, 2005; Tseng, Su, & Huang, 2006;Tseng et al., 2008), such as decision tree, hidden Markov model(HMM), K nearest neighbor (KNN), association mining, supportvector machine (SVM), etc. Through the automated descriptionsof videos, the user’s interest and videos can be associated semanti-cally. Unfortunately, diverse concepts cause distorted descriptionsand thereupon limit the effectiveness of video retrieval. As the lim-itation of textual-based video retrieval, in this paper, we will focusour attention on content-based video retrieval. For content-basedvideo retrieval, a video is traditionally divided into several scenesand each scene contains some shots that consist of a few time-lim-ited/similarity-limited image frames. Out of these sequentialframes, a representative frame will be defined as a key-frame. Ingeneral, based on the extracted visual features, such as color, shapeand texture, the related work for content-based video retrieval canbe categorized into the followings.

� Key-frame-based retrieval: In the beginning of content-basedvideo retrieval, the researchers attempted to search the desiredvideos by an image. However, the concept of an image is hard torepresent that of a video. In other words, a video is composed ofa set of sequential images and audio. For content-based videoretrieval, a query video can convey the richer content informa-

tion to a search system than a query image. Accordingly, anumerous past literatures (Aoki, Shimotsuji, & Hori, 1996; Jain,Vailaya, & Wei, 1999; Kim & Park, 2002) devote their attentionto finding the relevant videos by sequentially comparing thekey-frames of the query video with those of the target videos.Clearly, the computation cost is so high that the users cannotput up with the long response time. Besides computation cost,what seem to be lacking in this paradigm are the considerationsfor the temporal, sequence and duration of shots in a video.

� Sliding-window-based retrieval: With more considerations for thetemporal continuity of shots (Adjeroh, Lee, & King, 1998; Santini& Jain, 1999), proposed specialized distance functions to look forthe matching videos by calculating the similarities between theshots of query and those of target videos. Chen and Chua (2001),Kim and Chua (2005) made use of longest common subsequence(LCS) matching technique to find the longest subsequentialframes common to all sequential frames in two sequences.Hence, the temporal similarity between two video clips can bederived by LCS measure. Unfortunately, the computation costfor sequential visual feature comparisons is very high.

� Cluster-based retrieval: In traditional content-based video retrie-val, one of the important factors is the similarity model. Basedon the similarity of the shots, Liu, Zhuang, and Pan (1999),Wu, Zhuang, and Pan, 2000 proposed a clustering-based similar-ity cut point to distinguish whether two shots are similar or not.In the related literature (Cheng & Xu, 2003), Cheng et al. pre-sented a hierarchical clustering approach to build a shot clustertree. Through traversing this tree level by level, the similar shotscan be found efficiently through eliminating the search space.However, this study still needs high-priced manual annotationcost. This problem causes a large limitation in dealing with ahuge amount of data.

� Graph-based retrieval: According to above descriptions of treekinds of content-based video retrieval methods, we can realizethe importance of the temporal continuity of shots. The otherresearchers in this field regarded content-based video retrievalas a graph-based matching problem. On the basis of the tempo-ral continuity of shots, Shan and Lee (1998) adopted similaritymeasures optimal mapping (OM) and optimal mapping with rep-

Page 3: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

5070 J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085

lication (OMR) to determine the similarity between two shots.Furthermore, in Peng and Ngo (2006), maximum matching(MM) were utilized to filter the irrelevant shots and OM wereutilized to rank the similarity of clips according to visual andgranularity factors. The main disadvantage is that the computa-tion complexity increases rapidly as the amount of the shotsraises.

3. The proposed method

The main challenge in content-based video retrieval is: how toutilize video contents to search user’s interested videos effectivelyand efficiently. In fact, effective and efficient retrieval primarily liesin two aspects: index and search strategies. In this section, we pres-ent how our proposed method achieves the high quality of con-tent-based video retrieval by the special pattern-based matchingtechniques in great detail.

3.1. Basic idea

Before introducing our proposed method, we have to clarify thebasic idea. Generally speaking, a video is composed of a sequenceof shots/key-frames. Due to the relations or co-relations of thesesequential shots, the video retrieval strategy, instinctively, has toconsider the temporal continuity of shots. That is to say, two videoclips are similar if the subsequences of both are similar. Because ofthe complicated video contents, the complexity for CBVR is muchhigher than that for content-based image retrieval (CBIR). For exam-ple, Fig. 2 shows two shot sequences extracted from two videoclips. Overall, (a) and (b) can be viewed as a set of relevant videoclips since the temporal continuities of two sequences are almostthe same in terms of visual features. In contrast to CBIR, this exam-ple delivers the critical point that effective video retrieval dependson a good sequence matching. Based on this viewpoint, in this pa-per, we index the same sequences as a tree structure to make videoretrieval more efficiently.

3.2. Overview of the proposed method

In order to take into consideration of both effectiveness and effi-ciency, we propose a novel pattern-based indexing technique toapproximate the optimal solution for enhancing content-based vi-deo retrieval. To achieve this purpose, as illustrated with Fig. 3, thewhole procedure consists of the following stages.

3.2.1. Preprocessing stageIn principle, this stage mainly involves video preprocessing,

which includes shot detection, feature extraction, shot clusteringand shot encoding. Because our proposed method is based on the

Fig. 2. Example of two

visual features of shots, this is a foundational stage for processingthe query clip and the target videos. Finally, whether for the queryclip or the target videos, each shot is assigned a symbol by itsbelonging cluster number.

3.2.2. Indexing stageThe goal of this stage is to build two types of index-tree, namely

FPI-tree and AFPI-tree, by the symbolized patterns of the target vid-eos. The trees can provide content-based video retrieval with agood support to hunt the relevant videos efficiently. That is, with-out index-tree, efficient retrieval cannot be successfully attained.

3.2.3. Search stageOnce the index trees are ready, how to make use of the index

trees to search the most similar videos to the query clip is the pri-mary task in this stage. Based on the proposed index trees, we de-velop two pattern-based search algorithms, namely FPI-search andAFPI-search, to meet the user’s need on content-based video retrie-val. For FPI-search, if a great many matching videos are found, there-rank operation will be triggered to re-sort the matching videosby the visual similarities. For AFPI-search, the re-ranking operationof FPI-search can be skipped to save the computation cost. How-ever, AFPI-search and FPI-search have individual advantages ondifferent types of video retrieval. The major difference of FPI-search and AFPI-search is clearly clarified in the succeedingsections.

3.3. Preprocessing stage

Like traditional CBVR, this stage containing several essentialoperations can supply the necessary elements to the indexingand search stages. In detail, this stage includes the followingoperations.

3.3.1. Shot detectionIn this operation, for the query clip and the target videos, we

perform transitional shot detection to divide a video into a set ofsequential shots. Finally, the key-frame of each shot is defined.Hence, a shot within a video clip is represented by a key-framein the remainder of this paper.

3.3.2. Shot expurgationFrom the TV programs, most of the divided shots are embedded

with some banners or marquees. However, the banners and mar-quees are the noises that degrade the visual comparison betweentwo frames seriously. Thereupon, pruning the banners and mar-quees is necessary for enhancing the visual comparison. Fig. 4 isan example for expurgating the shot. A shot is segmented into8 � 8 regions. The central 36 regions are kept and the others are

similar video clips.

Page 4: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Fig. 3. Workflow of the proposed method.

J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085 5071

expurgated. As a result, most of banners and marquees in a shot arepruned in our preprocessing.

3.3.3. Feature extractionAfter previous processes, color layout and edge histogram are

both extracted from the expurgated shots. These features are help-ful to Shot Clustering and Re-ranking operations in the preprocess-ing stage and the search stage, respectively.

3.3.4. Shot clustering and encodingTo construct the pattern-based index tree, encoding the shots is

necessary. The main contribution of this work is that, the featuredimensionality can be reduced substantially and the patternmatching cost becomes very low. In this work, the shots are clus-tered by the well-known algorithm k-means and each shot is as-signed a symbol by its belonging cluster number, as shown inFig. 5. Another important issue to address here is the quality ofclustering since it actually makes a significant impact on the qual-ity of pattern-based video retrieval. Thus, we adopt the followingvalidation measures to confirm the clustering quality. As shownin Fig. 6, the whole validation procedure does not stop until all cri-teria are satisfied. The involved criteria, namely Local Proportion,Local Density and Global Density, are described as follows.

Fig. 4. Example of sh

� Local proportion (LP): Due to the characteristic of k-means, thenumber of points of each cluster is different. The local propor-tion cl represents the number of shots in a cluster. If lots of clus-ters contain very few shots at a clustering iteration, theclustering result is not good. In other words, the high local-pro-portion rate shows the reliable quality of clustering. The thresh-old for local proportion is defined as:

LP ¼ jShotjjClusterj ð1Þ

where shot denotes the set of all shots and Cluster denotes theset of all clusters. Therefore, if 20% of cl cannot exceed LP, theclustering result is not good enough in this paper.

� Local density (LD): Local density represents the density of a clus-ter. In fact, it stands for the entropy of a cluster. The low localdensity represents an aspect that, most shots in a cluster arenot very similar. Hence, if lots of clusters are with low densitiesat a clustering iteration, the clustering quality is low. Based onconfidence interval estimation, the local density distributioncan be formulated as:

P S�z�a=2rLDffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijClusterj

p !

6 l 6 Sþz�a=2rLDffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijClusterj

p !( )

¼ 1� a ð2Þ

ot expurgation.

Page 5: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Fig. 5. The encoding process.

5072 J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085

where S ¼ [sn; sn denotes the average of the distances betweenany two shots in the nth cluster, the mean of S for all clusters isS and rLD is the standard deviation of S. The formulation indicatesthat a 100(1�aÞ percent confidence interval on the local density lcan be derived by considering the distribution of l. In this paper,we want to find a 95% two-sided confidence interval on l. That is,a 95% interval implies that 1� a ¼ 0:95, so a ¼ 0:05 andZa=2 ¼ Z0:025 ¼ 1:96. Then, the confidence interval is constructedby using Eq. (2). Finally, if 30% of local densities cannot fall intothe confidence interval, the clustering procedure will repeat.

� Global density (GD): Global density is the density of the globalclusters. In contrast to local density, the large averaging distanceamong the global clusters represents the good dispersion anddistinction. According to this notion, the global density isdefined as:

GD ¼ D� 0:7�rGD ð3Þ

Fig. 6. The validation procedure for clustering.

whereD is the set of the distances between two clusters, D is theaverage of D and rGD is the standard deviation of D. Hence, if 30%of the distances in D cannot exceed GD, the quality of clusteringis bad.

On the basis of above, we set three thresholds to ensure theclustering being good enough to support the indexing and searchstages. That is, the clustering algorithm ends with that threethresholds are all satisfied. Fig. 7 is a proper example to explicatethe validation of clustering: (1) cluster 1 is better than cluster 2 be-cause Local-Density (cluster 1) < Local-Density (cluster 2), (2) clus-ters 1 and 2 are more heterogeneous than clusters 3 and 4because the distance between clusters 1 and 2 is longer than thatbetween clusters 3 and 4, and (3) cluster 4 is not a good cluster be-cause the number of shots is much smaller than the averageLP ¼ 15=4 � 4.

3.4. Indexing stage

After the video clips in the database are symbolized, the pro-posed FPI-tree or AFPI-tree in this stage is built to enhance con-tent-based video retrieval. Table 1 is a simple example of clip-transaction list that contains 4 target clips. Each clip consists ofseveral sequential shot-patterns. By this clip-transaction list, wecan build the proposed index-tree, with respect to FPI-tree andAFPI-tree. In comparison with earlier studies, the major contribu-tion of the proposed index-tree is that we can reduce the matchingcomplexity and cost of video retrieval via traversing the index-tree.

Fig. 7. Example of clustering quality.

Page 6: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Table 1Example of clip-transaction list.

Clip-id Shot/Key-Frame

Clip 1 A, B, C, AClip 2 C, B, B, A, E, FClip 3 F, F, E, E, A, B, D, B, C, A, BClip 4 B, C, G, C, A, D, B

J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085 5073

Basically, the task for building the index-tree can be divided intotwo parts, including the generation of temporal patterns and theconstruction of index-tree.

3.4.1. Generation of temporal patternsTo consider both the temporal continuity and the duration of

shots, a subsequence has to slide along the shot sequence, so-calledsliding window. For example, assume that a query clip contains aset of patterns {B, A}. From Table 1, 4 clips all contain {B, A}. Butthe 4th clip is the most dissimilar one to the query since the dura-tion of {B ? A} in Clip 4 is too large to represent the good temporalcontinuity. Accordingly, the window size particularly noteworthyin this work can be divided into two types: static and dynamic win-dow sizes. Like traditional setting of sliding window size, staticwindow size has to be initialized before building the index-tree.Note that, window size (defined as winsize) in this paper indicatesthe next winsize neighbors from the starting pattern. The proce-dure of generating two shot-patterns is shown in Fig. 8. For exam-ple, consider Clip 1 in Table 1. If winsize is 3, the next 3 patterns {B,C, A} from A are contained in the window {A, B, C, A}. The mainadvantage of static window size is that the index-tree can be builtoff-line and the tree construction cost can be saved while perform-ing the pattern matching operation. In contrast to the static win-dow size, dynamic window size is an adaptive method that canadjust the window size according to the length of the query clip.That is, if the query clip is shorter than the initialized window size,the dynamic window size is more reasonable than the static win-dow size. In addition to its adaptation property, another advantageof dynamic window size is that, even though the database is up-dated frequently, the incremental videos still can be found. The dy-namic window size is defined as:

Dwinsize ¼ jQueryjRatio

� �: ð4Þ

where Query is the set of shot-patterns in the query clip and Ratio isthe expected ratio to the length of the query clip. For example, if a

Fig. 8. Procedure of pa

query contains 5 shots and the expected ratio is 2, the Dwinsize is5=2 � 3.

In addition to window size, another important point to furtherclarify is the length of the needed patterns. In this paper, twoshot-patterns are enough to solve the sequence matching problem.Moreover, it can reduce the size of index-tree and further save thesearch cost. Table 2 illustrates that any multiple shot-patterns canbe implied by two shot-patterns. That is, it is redundant for gener-ating the patterns longer than two. Note that, as shown in Table 3,the duplicate patterns within a video clip have to be pruned forFPI-tree since the duplicate patterns are recognized as redundantpatterns. In contrast to FPI-tree, all duplicate patterns are pre-served to construct the AFPI-tree.

3.4.2. Construction of index-treeAfter generating two shot-patterns within a sliding window,

two types of index-tree are then constructed to serve the videosearch, namely FPI-tree and AFPI-tree. The detailed description ofbuilding the index-tree is given in the following subsections.

3.4.2.1. Building fast-pattern-index tree. More generally, FPI-tree canbe regarded as a 2-pattern-based prefix-tree and the constructioncan be viewed as an iterative operation. For each clip in the data-base, we have to generate all two shot-patterns, which is repre-sented as ‘‘2-pattern” in the followings. If a 2-pattern is sharedwith multiple clips, the related clip ids can form the queue prefixedby the specific 2-pattern. Let us take an example for constructingFPI-tree. Assume that the Ratio is 2 and thus Dwinsize is 3. Basedon Table 1, the FPI-tree is shown as Fig. 9. As stated above, theduplicate clip ids do not appear in a prefixed queue of this tree.Thus, the building and search cost can be reduced significantly.

3.4.2.2. Building advanced fast-pattern-index tree. In fact, FPI-tree isdeveloped to speed up the video search without considering theduplicate patterns. However, its simple data structure leads toadditional visual re-ranking cost for the remaining video search.The detailed explanation is given in the Section 3.5. The problemof additional visual re-ranking cost motivated us to modify FPI-treeby considering the occurrences and co-occurrences of the patterns.Briefly, the goal of AFPI-tree is to elevate the efficiency and effec-tiveness of video retrieval by storing more pattern information.In detail, the AFPI-tree construction can be decomposed of foursteps.

ttern generation.

Page 7: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Table 2Example of multiple shot-patterns in Clip 1 with the sliding window size set as 3.

Starting shot Two shot-pattern Multiple shot-pattern

Clip 1: A, B, C, AA A ? B, A ? C, A ? A A ? B, A ? C, A ? A,

A ? B ? C, A ? B ? A,A ? C ? A, A ? B ? C ? A

B B ? C, B ? A B ? C, B ? A, B ? C ? AC C ? A C ? A

Table 3Example of the two shot-patterns in Clip 2 with the sliding window size set as 4.

Starting shot Two shot-pattern for FPI Two shot-pattern for AFPI

Clip 2: C, B, B, A, E, FC C ? B, , C ? A, C ? E C ? B, C ? B, C ? A, C ? E

B B ? B, B ? A, B ? E, B ? F B ? B, B ? A, B ? E, B ? FB , , B ? A, B ? E, B ? F

A A ? E, A ? F A ? E, A ? FE E ? F E ? F

5074 J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085

Step 1. Calculate the total number of the 2-patterns. In compari-son with FPI-tree, it needs to calculate the total number ofthe 2-patterns occurring in all sliding windows of a clip.Suppose that the length of a clip j is defined as clthj. Thetotal number of the 2-patterns is defined as:

Npatternj ¼ ½ðwinsizeÞ�ðclthj �winsizeÞ� þXwinsize�1

i¼1

i: ð5Þ

For example, if winsize is 3, Npattern for Clips 1, 2, 3 and 4 are 6, 12,27 and 15, respectively.Step 2. Calculate the frequency of each 2-pattern window by

window. Let the frequency of each 2-pattern be definedas tf. For example, considering Table 1, the tf of the 2-pat-tern set {{A, B}, {A, C}, {A, A}} in Clip 1 is {1, 1, 1}.

Step 3. Normalize the frequency of each 2-pattern. The normal-ized frequency of the ith pattern in the jth clip is:

Ntf ji ¼

tf ji

NPatternjð6Þ

For example, the Ntf of the 2-pattern set {{A, B}, {A, C}, {A, A}, {B, C},{B, A}, {C, A}} in Clip 1 is {1/6, 1/6, 1/6, 1/6, 1/6, 1/6}.

Fig. 9. Example

Step 4. For each 2-pattern, insert the related clip id (called cli-p_id) and Ntf into the prefixed 2-pattern queue of AFPI-tree. According to Table 1, Fig. 10 is an illustrative exam-ple for the AFPI-tree, under the winsize is 3. A node in aprefixed 2-pattern is defined as clip id;Ntf clip id

pattern id

� �.

3.5. Search stage

The goal of our proposed search is to find the user’s interestedvideos by discovering the relational patterns between the user’squery and the targets. To meet the different needs of video search,the proposed approach are classified into to types: FPI-search andAFPI-search. From the perspective of effectiveness, FPI-search caneliminate lots of irrelevant videos and precisely find the desiredvideos by the pattern matching and visual re-ranking strategies,respectively. However, its cost is more prohibitive than that ofAFPI-search. By contrast, AFPI-search is a modification of FPI-search to speed up the search procedure. Without consideringthe visual re-ranking, the cost is comparatively low and the perfor-mance is comparatively high. The details are described as follows.

3.5.1. FPI-SearchIn this section, we proceed to describe how to find the most rel-

evant videos to the query clip using FPI-search. As mentionedabove, the involved work is decomposed of two steps: (1) Searchthe matching patterns and (2) Re-rank the search results.

3.5.1.1. Search of matching patterns. The whole search procedurestarts with that a query clip is processed by shot detection, shotexpurgation, feature extraction, cluster determination and shotencoding in the preprocessing stage. As a result, each shot of thequery clip can be assigned a symbol by determining the shortestcluster. Afterward, if the sliding window size is dynamical, FPI-treeis built on-line. Otherwise, FPI-tree, which is built off-line, has tobe loaded from the database. Finally, the proposed pattern match-ing method in this stage will be performed to find the most rele-vant videos.

Whether the FPI-tree is built on-line or not, the major concernof this operation is to look for the matching patterns to the querywindow by window. As shown in Fig. 11, the search strategy can beviewed as 2-pattern-based prefix-search, also called depth-first-search. From line 6 to line 13 of Fig. 11, for each sliding windowof the query clip, if the target clips occur in the queue prefixedby the 2-patterns of a sliding window, the counts of the relevantclips will be accumulated. At last, the counting table CountTablefor the target clips is derived. As a matter of fact, the traversingand counting operations can be finished very quickly.

of FPI-tree.

Page 8: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Fig. 11. Algorithm FPI_PatternSearch.

Fig. 10. Example of AFPI-tree.

Table 4Example of the shot-patterns in the query sequence, under winsize = 3.

Query sequence Two shot-pattern

B, C, A, D, A B ? C, B ? A, B ? DC ? A, C ? DA ? D, A ? AD ? A

J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085 5075

For example, consider a query clip Q = {B, C, A, D, A}. Assumethat the winsize is 3. Table 4 is an example for generating all 2-pat-terns under window size is 3, which is also shown in line 5 ofFig. 11. Next, we can get the further results by traversing the FPI-tree, as shown in Table 5. From Table 6, the matching problemcan be converted into the problem for generating frequent 1-item-sets. A 2-pattern can be viewed as a transaction and a clip standsfor an item. In consequence, the frequency of each relevant clipcan be obtained. In this operation, the high frequency of a clip rep-resents its high matching ratio under the 2-patterns within thesliding windows. However, to avoid too many returned results, a

criterion for selecting candidate clips is necessary in this work. Itcan be defined as:

Page 9: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Table 5Example of the results after traversing FPI-tree.

Transaction_id Two shot-pattern Relevant clip

1 B ? C 1, 3, 42 B ? A 1, 2, 33 B ? D 34 C ? A 1, 2, 3, 45 C ? D 46 A ? D 3, 47 A ? A 18 D ? A 3

Table 6Example of the counting table CountTable.

Clip_id Count

Clip 3 6Clip 1 4Clip 4 4

Clip 2 2

Fig. 12. Algorith

5076 J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085

supp ¼ clip countjtransactionj ; ð7Þ

where clip_count indicates the frequency of a clip on a transactionlist like Table 5 and transaction indicates the set of two shot-pat-terns within the query clip. As shown in line 13 of Fig. 11, if the clipcannot exceed the presetting threshold thold, it is not good enoughto be the candidate results. For example, assume that the presettingthreshold thold is 30%. As referred to Fig. 9, Clip 2 will be deletedsince its supp ð2=8 ¼ 25%Þ cannot exceed thold. At last, the trimmedcounting table is returned.

3.5.1.2. Re-rank of search results. According to the counting tablegenerated, the clips should be sorted by their counts. Unfortu-nately, in real applications, the relevant clips with the same countare so many to confuse the users in making a choice among the re-sults. Therefore the problem motivates us to find a solution to re-rank the search results generated by Algorithm FPI_PatternSearch.Since 2-patterns cannot distinguish the very similar video clipseffectively, we extend the matching patterns to more than two pat-terns, as shown in Figs. 12 and 13. However, to reduce the compar-ison cost in this work, only the sequential patterns are considered,

m ReRank.

Page 10: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Fig. 14. Example of computing the visual similarities between the query and thecandidate video clips.

Fig. 13. Procedure Count().

J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085 5077

as shown in line 6 of Fig. 13. In worst case, even though the match-ing patterns are extended to the whole query sequence, there stillexist some clips with the same count. The last solution to distin-guish the similar results is computing the visual similarities ofthe same patterns existing in both the query and the candidate vi-deo clips, as shown in line 16–20 of Fig. 12. The major notion be-hind this scenario is that, although the patterns are the same, thelow-level features may not be the same. As a result, the visual sim-ilarity is also different.

Following the above example, we keep taking an example to de-scribe the re-ranking operation. From Table 6, Clips 1 and 4 withthe same count are assigned the same rank. Now we have to re-rank them. First, we find three sequential 3-patterns{B ? C ? A}, {C ? A ? D} and {A ? D ? A} from the query clip.Then we perform Procedure Count() to derive the matching resultsand thus generate the new counting table CntTable, as shown in Ta-bles 7 and 8. Unfortunately, it still cannot be distinguished. Thenwe cannot find any sequential 4-patterns and 5-patterns in thequery clip, which also exist in Clips 1 and 4. At last, Fig. 14 showsthat the Clip 4 is more similar to the query than Clip 1 after calcu-lating the visual similarities. Therefore the new ranking list is{Clips 3, 4, and 1}.

3.5.2. AFPI-searchTo improve the retrieval quality of FPI-search, AFPI-search aims

at saving the re-ranking cost by using AFPI-tree. As the property ofAFPI-tree, the major considerations of AFPI-search are the patternfrequency and the pattern weight. The pattern frequency has beendiscussed in Section 3.4.2, namely Ntf. Now we explain how to cal-culate the pattern weight. Assume that DB is the set of the target

Table 7Example of the results after 3-pattern matching.

Transaction_id Three shot-pattern Relevant clip

1 B ? C ? A 12 C ? A ? D 43 A ? D ? A null

Table 8Example of the counting table CntTable under 3-pattern matching.

Clip_id Count

Clip 1 1Clip 4 1

clips DB ¼ fd1; d2; . . . :dmg and PQ i is the queue prefixed by the ithpattern of some clips. Consider the set of the nodes in PQi isfdx; . . . ; dyg;1 6 x 6 y 6 m. The ith pattern weight referred to PQ i

is defined as:

Idfi ¼ logjDBjjPQij

� �: ð8Þ

Behind the notion of the pattern frequency, the pattern frequencyrepresents the matching rate over the targets. Thus, the more thepattern frequency is, the more important the pattern is. Comparedwith the pattern frequency, the pattern weight represents the dis-tinctiveness of the pattern. That is, if the pattern occurs in more vi-deo clips, its distinctiveness is relatively low. Of course, its weight islow. Fig. 15 shows the proposed search algorithm, namely AFPI_Pat-ternSearch. The primary difference between FPI-search and AFPI-search lies in line 9–15 of Fig. 15. The degree of the clips is definedas ðdegree ¼ Ntf cnode�

i IdfiÞ. Finally, the degree of each clip is storedinto the DegreeTable and the clips are sorted by the related degrees.

For example, consider a query sequence {B, C, A, D, A} and Dwin-size is 3. After traversing the AFPI-tree, the related 2-pattern set isshown in Table 9. Based on Table 9, for each clip, the related degreeis calculated by performing line 23–29 of Fig. 15. Similar to FPI-search, it can be viewed as the frequent pattern mining, where 2-pat-tern stands for transaction id, the clip stands for item and Ntf standsfor support. The first step is to accumulate the degree

Idf �pattern idNtf clip idpattern id

� �of each clip. For example, the degree of

Clip 1 is Ntf 1�1 Idf1 + Ntf 1�

2 Idf2 + Ntf 1�4 Idf4 + Ntf 1�

6 Idf6 + Ntf 1�8 Idf8¼

fðlogð4=3ÞÞ�ð1=6Þg + fðlogð4=3ÞÞ�ð1=6Þg + fðlogð4=4ÞÞ�ð1=6Þg +fðlogð4=4ÞÞ�ð1=6Þg + fðlogð4=1ÞÞ�ð1=6Þg ¼ f0:1249�0:1667g +f0:1249 �0:1667g þ0 þ 0 þ f0:602�0:1667g ¼ 0:0208 þ 0:0208þ0 þ 0 þ 0:1004 ¼ 0:1420. Accordingly, the derived DegreeTable is

Page 11: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Fig. 15. Algorithm AFPI_PatternSearch.

Table 9Example of the results after traversing AFPI-tree.

Pattern_id Two shot-pattern

Idfpattern_id Relevant clip

1 B ? C log(4/3) (1, 1/6), (3, 2/27), (4, 2/15)2 B ? A log(4/3) (1, 1/6), (2, 2/12), (3, 1/27)3 B ? D log(4/1) (3, 1/27)4 C ? A log(4/4) (1, 1/6), (2, 1/12), (3, 1/27), (4, 2/

15)5 C ? D log(4/1) (4, 1/15)6 C ? A log(4/4) (1, 1/6), (2, 1/12), (3, 1/27), (4, 2/

15)7 A ? D log(4/2) (3, 1/27), (4, 1/15)8 A ? A log(4/1) (1, 1/6)9 D ? A log(4/1) (3, 1/27)

Table 10Example of the degree table DegreeTable.

Clip_id Accumulated (Ntf*Idf) Degree

Clip 1 {(log(4/3))*(1/6)}+{(log(4/3))*(1/6)}+{(log(4/4))*(1/6)}+{(log(4/ 4))*(1/6)}+{(log(4/1))*(1/6)}

0.1420

Clip 4 {(log(4/3))*(2/15)}+ {(log(4/4))*(2/15)}+{(log(4/1))*(1/15)}+ {(log(4/4))*(2/15)}+{(log(4/2))*(1/15)}

0.0769

Clip 3 {(log(4/3))*(2/27)}+{( log(4/3))*(1/27)}+{(log(4/1))*(1/27)}+ {(log(4/4))*(1/27)}+ {(log(4/4))*(1/27)}+ {(log(4/2))*(1/27)}+ {(log(4/1))*(1/27)}

0.0696

Clip 2 {(log(4/3))*(2/12)}+{(log(4/4))*(1/12)}+ {(log(4/4))*(1/12)}

0.0208

5078 J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085

as shown in Table 10. Compared with FPI-search, the ranking list{Clips 3, 4 and 1} is different that of AFPI-tree {Clips 1, 4, 3 and 2}.From the resulting example, we can obtain: First, {C!A} is an unim-portant 2-pattern since it occurring in every clip. Hence, the referredweight Idf is logð4=4Þ ¼ 0. Second, although Clip 3 is the longest onein the database, it is assigned the low rank on the ranking list. This is

because its normalized frequency of 2-pattern is too low and the re-lated pattern weight is also low. Here we have to clarify the impor-tance of the frequency normalization. Un-normalized frequencycould lead that, the longer the clip, the larger the pattern frequency,the higher the clip rank. Third, Clip 2 is the most dissimilar one sincethe related pattern composition is far from that of the query.

4. Experimental evaluations

In the previous section, three main stages have been presentedclearly. In this section, we verify the notion of our proposed meth-od through the complete experiments. The experiments wereimplemented in C++ on a Pentium-4 3.0 GHz personal computerwith 1 GB RAM running on Windows XP.

4.1. Experimental data

The experimental data consists of the collection of 15 video cat-egories in real life. Table 11 shows the composition of experimen-tal data. In this video collection, 471 diverse video clips wereselected as the experimental data sets. The total duration of theexperimental data is around 1550 min and the data size is about85 GB. Moreover, there are 12,727 shots split from the experimen-tal data. According to Table 11, the experimental data can be di-vided into three types. The main concern is to know the abilityof our proposed method for different kinds of videos. Hence, fromData 1 and 2, we selected half the videos as the query clips. ForData 1, the successful retrieval is based on the assumption that,if the query clip is a commercial clip about Coca-Cola, the resultshave to be related to Coca-Cola. Similarly, the successful retrievalfor Data 2 is based on the assumption that, the returned news vid-eos have to be related to the subject of the query clip. For Data 3,we randomly selected 33% of video clips from each category asthe query set. The successful retrieval is defined as that, the cate-gory of the resulting set is the same as that of the query clip.

Page 12: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Table 11The characteristics of experimental data.

Type Category #Clip #Shot #Avg.shot

Duration

Datal Commercial 80(16kinds)

657 8.21 0:28:52

Data 2 News-event 133(24kinds)

1606 12.08 3:53:04

Data 3 Cartoon 17 10464 76.06 1:30:12Sea-world 18 12.89 0:36:00Plane 19 16.47 0:40:59Military 20 18.75 1:34:07Sports Indoor Basketball 21 42.90 2:06:51

Billiards 18 43.33 2:08:20Volleyball 20 51.35 1:35:02Tennis 21 42.05 1:36:02

Outdoor Baseball 15 43.05 1:40:05Surfing 18 18.89 0:36:04Soccer 24 56.63 3:26:02Motor-racing

24 55.21 1:44:47

Fl-racing 23 31.83 1:19:27Total 471 12727 24:55:43

Table 12The optimal #cluster of experimental data.

Dataset Involved #shot Optimal #cluster

Data 1 657 25Data 2 1606 30Data 3 10464 40

Table 13

J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085 5079

The evaluation was investigated in terms of two main measures,namely Precision and Recall. Precision represents the ratio for thecardinality of correctly returned video clips over the cardinalityof the resulting video clips. Recall indicates the ratio for the cardi-nality of correctly returned video clips over the cardinality of therelevant video clips. Two measures are defined as follows:

precision ¼ jCorrectjjReturnedj � 100%; recall ¼ jCorrectj

jRelevantj � 100%

where Correct is the correct retrieval set, Returned is the resultingvideo set and Relevant is the ground-truth. For example, for 10 re-turned videos, the category of five video clips is the same as thatof the query clip. Therefore Precision is 5=10� 100% ¼ 50%. If 20relevant videos with the same category in the database, Recall is5=20� 100% ¼ 25%.

The optimal parameter settings.

Dataset OptimalRatio for FPI

OptimalWinSize for FPI

Optimal Ratiofor AFPI

OptimalWinSize for AFPI

Data 1 1.6 10 2.7 2Data 2 3.5 4 3.2 2Data 3 3.5 8 3.9 2

Data 1

40

50

60

70

80

90

100

0 20 40 60 80 100recall (%)

prec

isio

n (%

)

FPI-staticFPI-dynamic

Fig. 16. The precisions and recalls of FPI for Data 1, in terms of dynamic and staticwindow sizes.

4.2. Experimental results

Basically, the evaluations were conducted on the purposesincluding: (1) the evaluations for our proposed methods under dif-ferent parameter settings, (2) the comparisons of the proposed ap-proaches and the others in terms of effectiveness and efficiency,and (3) the evaluations of the retrieval quality for different catego-ries. The involved six methods include FPI-dynamic, FPI-static,AFPI-dynamic, AFPI-static, BSW and ASW. FPI-dynamic and AFPI-dynamic are our proposed methods that are based on the dynamicwindow size setting, and the execution time includes tree con-struction time and search time. FPI-static and AFPI-static are alsoour proposed methods that are based on the static window sizesetting, and the execution time only includes search time. The ma-jor difference between dynamic search and static search is that thedynamic search is more adaptive than static search in terms of thesliding window size. BSW indicates ‘‘Basic-Sliding-Window-basedvideo retrieval”. This method is a basic sequence matching tech-nique by calculating the visual similarities between the slidingwindows of the query clip and those of target video clips. ASWindicates ‘‘Advanced-Sliding-Window-based video retrieval”. Thisis LCS-like sequence matching technique referred to (Chen & Chua,2001). The detailed experimental results will be given in the suc-ceeding subsections.

4.2.1. Evaluations of our proposed approach under different parametersettings

To elicit the optimal search results, some involved parametersettings has to be examined. The first critical parameter is thenumber of clusters. In this experiment, what we can conclude fromthe observation of the related results is that, the more the shots,the more the clusters, the higher the retrieval quality. As a result,the best number of clusters for the remaining experiments is setas Table 12. After the experiments examined for different windowsizes, the best settings of sliding window size are summarized asTable 13.

The second crucial factor we concentrate our attention on is thesliding window. As far as the idea for the dynamic window size isconcerned, it is adjustable for different durations of clips. However,the results obtained were contrary to our intention. In this exper-iment, we adopted tops 1, 3, 5 and 10 results to measure the pre-cision and recall for FPI and AFPI. Figs. 16–21 reveals that, theresults using dynamic window size are pretty close to those usingstatic window size. Overall, dynamic window size setting is slightlyworse than static window size setting. The proper explanation isthat, a short window size can bring out the better results. In detail,a long window size could generate too many patterns to distin-guish the relevant videos. Unfortunately, dynamic window sizesetting meets the point. Furthermore, too many generated patternswill increase the execution time. Table 14 exactly shows this per-spective. As concluded from Table 14, static-based search performsfaster than dynamic-based search and AFPI performs faster than

Page 13: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Data 3

80828486889092949698

100

0 10 20 30 40 50recall (%)

prec

isio

n (%

)

FPI-staticFPI-dynamic

Fig. 18. The precisions and recalls of FPI for Data 3, in terms of dynamic and staticwindow sizes.

Data 2

40

50

60

70

80

90

100

0 20 40 60 80 100recall (%)

prec

isio

n (%

)

FPI-staticFPI-dynamic

Fig. 17. The precisions and recalls of FPI for Data 2, in terms of dynamic and staticwindow sizes.

Data 1

40

50

60

70

80

90

100

0 20 40 60 80 100recall (%)

prec

isio

n (%

)

AFPI-staticAFPI-dynamic

Fig. 19. The precisions and recalls of AFPI for Data 1, in terms of dynamic and staticwindow sizes.

Data 2

50556065707580859095

100

0 20 40 60 80 100recall (%)

prec

isio

n (%

)

AFPI-staticAFPI-dynamic

Fig. 20. The precisions and recalls of AFPI for Data 2, in terms of dynamic and staticwindow sizes.

Data 3

84

86

88

90

92

94

96

98

100

0 10 20 30 40 50recall (%)

prec

isio

n (%

)

AFPI-staticAFPI-dynamic

Fig. 21. The precisions and recalls of AFPI for Data 3, in terms of dynamic and staticwindow sizes.

Table 14The execution time for FPI-static, FPI-dynamic, AFPI-static and AFPI-dynamic.

Dataset FPI-static (s) FPI-dynamic(s)

AFPI-static (s) AFPI-dynamic(s)

Data 1 0.004 0.02 <0.0001 0.000258Data 2 0.01 0.07 <0.0001 0.000887Data 3 0.07 0.8 0.001489 0.024685

Data 1

40

50

60

70

80

90

100

0 20 40 60 80 100recall (%)

prec

isio

n (%

)

AFPIFPIBSWASW

Fig. 22. The precisions and recalls of FPI, AFPI, BSW and ASW for Data 1, in terms oftops 1, 3, 5 and 10.

5080 J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085

FPI. In fact, dynamic-based and static-based search are with indi-vidual advantages. Dynamic-based search is a query-adaptivesearch, but it needs more computation cost since each query hasto construct the index-tree on-line and the generated patternsare very large. In contrast to dynamic-based search, static-basedsearch lacks automated index-tree and automated window size.Hence, if the query length is shorter than the specified windowsize, the proposed search cannot work. From another viewpoint,new videos will never be found by un-updated static index-tree.Note that, we employ static-based strategy to carry out theremaining experiments.

4.2.2. Comparisons of our proposed approaches and other approachesAfter clarifying the involved parameter settings, the first evalu-

ation we are interested in is the effectiveness of our proposed

Page 14: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Data 3

70

75

80

85

90

95

100

0 10 20 30 40 50recall (%)

prec

isio

n (%

)

AFPI

FPI

BSW

ASW

Fig. 24. The precisions and recalls of FPI, AFPI, BSW and ASW for Data 3, in terms oftops 1, 3, 5 and 10.

Table 15The performance of FPI, AFPI, BSW and ASW for Data 1.

Dataset FPI (s) AFPI (s) BSW (s) ASW (s)

Data 1 0.004 <0.0001 0.03 0.1Data 2 0.01 <0.0001 0.08 0.38Data 3 0.07 0.001489 2.09 7.9

Data 2

40

50

60

70

80

90

100

0 20 40 60 80 100recall (%)

prec

isio

n (%

)

AFPI

FPI

BSW

ASW

Fig. 23. The precisions and recalls of FPI, AFPI, BSW and ASW for Data 2, in terms oftops 1, 3, 5 and 10.

J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085 5081

methods. Figs. 22–24 show the precision and recall results of FPI,AFPI, BSW and ASW for three data sets. Obviously, these results re-veal that our proposed methods outperform the other traditionalmethods significantly. For Data 1, Fig. 22 shows that, the effective-ness of four approaches is very close. Although BSW can bring outthe good precision under the recall is smaller than 0.75, unstableperformance of BSW cannot make BSW be the best choice among

0102030405060708090

100

base

ball

bask

et ba

ll

billia

rds ba

ll

carto

on

F1-rac

ing-ca

r

racing

-car

milita

prec

isio

n(%

)

Fig. 25. The precisions of FPI and

four methods for the users. For Data 2, Fig. 23 depicts the aspects:(1) the retrieval quality of news videos for our proposed methods isbetter than those for the others, and (2) our proposed methods per-form more stably than the others. Fig. 24 delivers the points similarto Fig. 23. From Fig. 24, we can further realize that our proposedmethod is very promising for CBVR, especially for the large-scalediverse data. Another evaluation we want to show is the perfor-mance. Performance is a very important factor for a search system.No one can put up with a long response time, especially for an on-line search system. Table 15 exhibits that, whatever the data is, ourproposed methods execute much faster than the other traditionalmethods. Although FPI has to re-rank the results, the executiontime is still much less than one second. Out of Table 15, the perfor-mance of ASW is much worse than that of FPI-based searchalthough the effectiveness of FPI, AFPI and ASW are pretty close.

On the whole, the experimental results further prove that ourproposed method can achieve the high quality of content-based vi-deo retrieval. Now let us summarize the discovery of above exper-imental results briefly. The choice between FPI and AFPI is madeupon the dataset. FPI is a visual-sensitive video access control. Thatis, it can find the videos with very similar visual features. Never-theless, it needs more execution cost since the results have to bere-ranked by further visual comparisons. On the contrary, AFPI isvery promising for any data set, especially for diverse data. It isfor the reason that, AFPI can precisely identify the discriminabilityof a pattern. Thus, the videos can be distinguished precisely.

4.2.3. Effectiveness of our proposed approach for different categoriesAfter presenting the involved parameter settings and the com-

parisons between our proposed approaches and the others, we willkeep going on the demonstration of effectiveness for different cat-egories. It can help us further realize the differences among the cat-egories and make an attempt to approximate better solutions inthe future. As shown in Figs. 25 and 26, the precisions of FPI andAFPI can reach 70%. The interpretation of the results is that, if dif-ferent categories share with the same visual contents, it is not easy

rypla

ne

sea-w

orld

socc

er

surfin

gten

nis

volle

y ball

FPI AFPI

AFPI for different categories.

Page 15: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

5082 J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085

to reach high precision of video retrieval. For example, the visualcontents of military, plane and surfing are so similar that the gen-erated patterns are not discriminative enough. For this problem,other contents, such as audio and motion, may be the next consid-erations in our future work. Still, the overall results can reveal that,

base

ball

bask

et ba

ll

billia

rds ba

ll

carto

on

F1-rac

ing-ca

r

racing

-car

milita0

5

10

15

20

25

30

35re

call

(%)

Fig. 26. The recalls of FPI and A

Fig. 27. Example of the resulting com

Fig. 28. Example of the resulting

FPI and AFPI are robust to find the relevant clips individually fordifferent categories.

At the last of this section, we will show the real experimentalresults by some illustrative examples, as depicted in Figs. 27–34.On one hand, the returned results depend on the visual relation-

rypla

ne

sea-w

orld

socc

er

surfin

gten

nis

volle

y ball

FPI AFPI

FPI for different categories.

mercial clips returned by AFPI.

news clips returned by AFPI.

Page 16: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Fig. 29. Example of the resulting cartoon clips returned by AFPI.

Fig. 30. Example of the resulting military clips returned by AFPI.

Fig. 31. Example of the resulting plane clips returned by AFPI.

Fig. 32. Example of the resulting racing-car clips returned by AFPI.

J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085 5083

Page 17: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

Fig. 33. Example of the resulting baseball clips returned by AFPI.

Fig. 34. Example of the resulting sea-world clips returned by AFPI.

5084 J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085

ships of the sequential shots, as mentioned in section 3.1. On theother hand, Figs. 30–32 depicts that the clips of military, planeand racing-car are very similar in visual contents. The phenomenareveal that the categories with the same visual temporal continuityare hard to be distinguished just by visual contents.

5. Conclusions and future work

In this paper, we have presented a novel method for content-based video retrieval by using pattern-based indexing and match-ing techniques. The main contribution of the proposed method isthat, the proposed approach achieves the high quality of video re-trieval without considering the query terms. The utilization of thepattern-based index can effectively deal with the problems of highdimensional visual features, which occur in current visual-basedsequence matching methods. The experimental results show thatthe proposed method can substantially enhance the precision andrecall for content-based video retrieval even though only two kindsof visual features are considered. Besides, AFPI is really an efficientmethod for finding the desired videos from the massive amount ofdiverse data. In the future, we will further address the following is-sues: First, in addition to color layout and edge histogram, moretypes of features, such as motion, audio and the other visual fea-tures, will be considered further. Second, we will apply AFPI-treeto the other types of content-based multimedia retrieval.

Acknowledgment

This research was supported by National Science Council, Tai-wan, ROC under Grant No. NSC 97-2422-H-006-001.

References

Adjeroh, D. A., Lee, M. C., & King, I. (1998). A distance measure for video sequencessimilarity matching. In Proceedings of the IEEE conference on multimediacomputing and systems (pp. 72–79). Austin, TX.

Aoki, H., Shimotsuji, S., & Hori, O. (1996). A shot classification method of selectingeffective key-frames for video browsing. In Proceedings of the fourth ACMinternational conference on multimedia (pp. 1–10). Boston, MA, USA.

Chen, L., & Chua, T. S. (2001). A match and tiling approach to content-based videoretrieval. In Proceedings of IEEE international conference on multimedia and expo(pp. 301–304). Tokyo, Japan.

Cheng, W. G., & Xu, D. (2003). Content-based video retrieval using the shot cluster tree.In Proceedings of the second IEEE international conference on machine learning andcybernetics (pp. 2901–2906). Xi’an, China.

Dimitrova, N., Zhang, H. J., Shahraray, B., Sezan, I., Huang, T., Zakhor, A., et al. (2002).Applications of video-content analysis. IEEE Transactions on Multimedia, 9,42–55.

Gaughan, G., Smeaton, A. F., Gurrin, C., Lee, H., & Mc Donald, K. (2003). Design,implementation and testing of an interactive video retrieval system. InProceedings of the fifth ACM SIGMM international workshop on multimediainformation retrieval (pp. 23–30). Berkeley, CA, USA.

Jain, A. K., Vailaya, A., & Wei, X. (1999). Query by video clip. ACM multimediasystems: Special issue on video libraries (Vol. 7(5), pp. 369–384). Secaucus, NJ,USA.

Kim, Y. T., & Chua, T. S. (2005). Retrieval of news video using video sequencematching. In Proceedings of the 11th international multimedia modellingconference (Vol. 00, pp. 68–75). Washington, DC, USA.

Kim, S. H., & Park, R.-H. (2002). An efficient algorithm for video sequence matchingusing the modified hausdroff distance and the directed divergence. IEEETransactions on Circuits System Video Technology, 12(7), 592–596.

Liu, X., Zhuang, Y., & Pan, Y. (1999). A new approach to retrieve video by examplevideo clip. In Proceedings of the seventh ACM international conference onmultimedia (pp. 41–44). Orlando, FL, USA.

Peng, Y., & Ngo, C. W. (2006). Clip-based similarity measure for query-dependentclip retrieval and video summarization. IEEE Transactions on Circuits and Systemsfor Video Technology, 16(5), 612–627.

Rautiainen, M., Ojala, T., & Seppänen, T. (2004). Analysing the performance of visual,concept and text features in content-based video retrieval. In Proceedings of thesixth ACM SIGMM international workshop on multimedia information retrieval (pp.197–204). New York, NY, USA.

Page 18: 2010 Effective Content-based Video Retrieval Using Pattern-Indexing

J.-H. Su et al. / Expert Systems with Applications 37 (2010) 5068–5085 5085

Santini, S., & Jain, R. (1999). Similarity measures. IEEE Transactions on PatternAnalysis and Machine Intelligence, 21(9), 871–883.

Shan, M. K., & Lee, S. Y. (1998). Content-based video retrieval based on similarity offrame sequence. In Proceedings of the IEEE conference on multimedia computingand systems (pp. 90–97). Austin, TX, USA.

Tseng, V. S., Lee, C-J., & Su, J-H. (2005). Classify by representative or associations(CBROA): A hybrid approach for image classification. In Proceedings ofinternational workshop on multimedia data mining (KDD/MDM).

Tseng, V. S., Su, J-H., & Huang, J-H. (2006). A novel video annotation method byintegrating visual features and frequent patterns. In Proceedings of the seventhinternational workshop on multimedia data mining (KDD/MDM). Philadelphia, PA,USA.

Tseng, V. S., Su, J-H., Huang, J.-H., & Chen, C-J. (2008). Integrated mining of visualfeatures, speech features and frequent patterns for semantic video annotation.IEEE Transactions on Multimedia, 10(1).

Virga, P., & Duygulu, P. (2005). Systematic evaluation of machine translationmethods for image and video annotation. In Proceedings of the fourthinternational conference on image and video retrieval (pp. 487–496). Singapore.

Wu, Y., Zhuang, Y., & Pan, Y. (2000). Content-based video similarity model. InProceedings of the eighth ACM international multimedia conference on multimedia(pp. 465–467). Los Angeles, CA, USA.

Zhu, X., Elmagarmid, A. K., Xue, X., Wu, L., & Catlin, A. C. (2005). InsightVideo:Toward hierarchical video content organization for efficient browsing,summarization and retrieval. IEEE Transactions on Multimedia, 7(4), 648–665.