[IEEE 2009 2nd IEEE International Conference on Computer Science and Information Technology - Beijing, China (2009.08.8-2009.08.11)] 2009 2nd IEEE International Conference on Computer

Video Retrieval − Evolution of Video Segmentation, Indexing and Search

Lekha Chaisorn, Corey Manders, and Susanto RahardjaInstitute for Infocomm Research, A*STAR

1 Fusionopolis Way, Connexis Tower, Singapore 138632email: {clekha,cmanders,rsusanto}@i2r.a-star.edu.sg

Abstract

This paper discusses the history and current trends ofvideo retrieval, focusing mainly on video segmentation,indexing and search. The objective is to share with thereaders how much we have done so far as well as the currenttrends in the field. Unlike text documents, video containsdynamic information such as audio, motion (object and/orcamera motions), etc. Thus, indexing videos for future searchstill remains a difficult problem. In addition, one particularproblem that remains with digital videos is that it is hardto deal with copyright issues. Thus, indexing videos needsto take this into consideration. In this paper, the historyand some of the-state-of-art methods that help to solve theseproblems will be addressed.

1. Introduction

The rapid advances in computing, multimedia, and net-working technologies have resulted in the production anddistribution of a large amount sof multimedia data, inparticular digital video. To effectively manage these sourcesof video, it is necessary to organize them in a way thatfacilitates user browsing and retrieval. A lot of effort hasbeen made by researchers to segment, index and organizedigital videos in terms of shots. A comprehensive survey onthis research was reported in [1]. Digital videos, especiallynews videos from outlets such as CNN, ABC, etc that areavailable on the web are a good source of information. Usersnormally do not start reading news or view news video fromthe start of a news broadcast until the end. Instead, the usersoften access the news by topic of interest. Some users givepriority to finance or business news while others may beinterested in world news such as the “war in Iraq”, etc. Thus,news video broadcasts need to be segmented into appropriateunits to support this kind of retrieval.

Another problem with digital videos is that, they maybe modified and edited with various tools quite easily.Many qualities of the resulting modified videos may bevery different than the originals. Thus, detection of modifiedcopies of an original video may be difficult, especially whenit has been heavily modified (such as color shift, resolutionchange, frame rate change, etc). This paper will addressthese problems with some state-of-the-art solutions. Our

approaches to shot and story segmentation as well as copydetection problems are also discussed.

2. Video Segmentation

Before we may perform an analysis on a particular video,the video needs to be segmented into semantic units termedshots. A video sequence, for example a half-hour newsvideo, may be comprised of a few hundreds shots. Severalshots of cohesive semantics (focusing on one main topic)form a story. What follows below is a brief summary ofshot, scene, and story. Methods of segmenting a video intoshots and stories are also discussed.

2.1. Shot Segmentation

A shot is a continuous group of frames that a cameratakes at a physical location. A semantic scene or story isdefined as a collection of shots that are consistent withrespect to a certain semantic theme, for example, severalshots taken at the beach. Yet another type of shot is termeda gradual transition. Gradual transition is frequently used asan editing technique to connect two shots together and canbe classified into three common types: fade in/out, dissolve,and wipe. A fade-in is a shot that begins in total darkness andgradually lightens to the full brightness of a scene. A fade-out is exactly the opposite. A dissolve is a gradual changefrom one scene into another scene. In such a transition, onescene gradually decreases in intensity (fade out), the otherscene gradually increases (fades in) at the same time andrate. Lastly, wipe shows the new scene appearing behind aline, which moves across the screen.

2.1.1. Existing Techniques. Research on segmenting aninput video into shots, and using these shots as the basisfor video organization is well established as reported in [1].Effective techniques for detecting abrupt changes or hardcuts are reported in [2] and [3]. The best accuracy achievedis over 90% . Most learning techniques are unsupervisedand use color, texture and/or motion as the features. In theCNN and ABC news videos used in TRECVID 2003 andTRECVID 2004, more than 60% of the total shots used inshot detection are hard cuts and more than 20 % are gradualtransitions.

_____________________________ 978-1-4244-4520-2/09/$25.00 ©2009 IEEE

2.1.2. The Use of Hue and Saturation for Shot Segmenta-tion. Aside from the shot segmentation techniques reportedin the TRECVID benchmarking workshop and existingtechniques reported elsewhere, we have also developed amethod using average hue to identify shot boundaries.

If we consider contiguous sets of video frames, statisticssuch as the average red values, the average intensity, and theaverage saturation will not change greatly, other than whenthere are shot changes (as a generalization). Aside fromwhen shots are faded from one to another, for these statistics,we will only observe abrupt changes. We certainly expectthe same for the average pixel hue and saturation, if weperform an RGB to HSV transformation on the colorspace,where hue (h) is calculated from the red, green, and bluepixel values.

As a means of detecting shot changes in video sequences,we have devised a method which uses the average hue andsaturation of the scene. Often, in scenes where there maybe changes in the overall illumination, for example, flashesof bright white lights, outside scenes where the sun movesbehind clouds, etc., we expect both hue and saturation tobe quite stable in such illumination changes. Consequently,tracking changes in the change of the average hue andsaturation was very effective in detecting shot changes.Sharp changes in the average hue of a video sequencetend to coincide with shot changes, as do changes in theaverage saturation. To help aid in detecting shot changeswhere transitions are more gradual, we maintain a runninghistory of the last 10 to 15 frames hue and saturationaverages, and then compare the incoming frame’s averagehue and saturation with that of the average hue of thesepast frames. We set a threshold value Th (threshold forhue) and Ts (threshold for saturation), such that when thedifference between the new frame’s average hue and thepast frames’ average hue or saturation is greater than itsrespective threshold, we signal a shot change. More detailson how to obtain hue values can be obtained in what follows.Figure 1 shows an example of a shot boundary identified byour method. As for key frame extraction, there are manymethods proposed to solve this problem. Some methodsselect frame(s) based on motion analysis [5] and someemploy a uniform sampling of multiple key frames in thevideo [6]. For simplicity, our method adopts the later ideaby selecting the first, middle and last frame of the shot asthe shot key frames.

2.2. Story Segmentation

We usually remember video contents, especially for newsvideo, in terms of events or stories but not in terms ofchanges in visual appearances as in shots. It is thus necessaryto organize video content in terms of small, single-storyunits that represent the conceptual chunks in users memories.Moreover, the stories can be summarized in different scales

to support queries such as “give me a summary on H1N1flu in yesterday news”, etc. Thus, the story units serveas the basic units for news video organization. Finally,these story units with their classified shots can be storedin the database to support news retrieval tasks. The problemof segmenting news video into story units is challenging,especially when there is no supplementary text transcript.Story segmentation based on text transcript is easier andless expensive than the segmentation performed on newsvideo using audio-visual based features. There are severaltechniques to perform text segmentation on news transcript.Most techniques are statistically-based, designed to findcoherent body of text terms that represent a story or topic.The story boundary therefore occurs at a position wherethere is least coherence or similarity between adjacent textunits. Based on this principle, one successful technique isthe tiling technique introduced by Hearst in 1994 as referredto in [1]. However, the maximum accuracy reported for storysegmentation based on news transcripts of CNN and ABCnews used in TRECVID 2003 evaluations [2] was only about62%. The reason for this low-level of performance is becausetext statistics alone are insufficient to capture the rich setof semantic clues and presentation features used to signifythe end of stories in news videos. Thus, there was a needto look into audio-visual features of news videos to assistin story segmentation. The research work done in [1] haddemonstrated the effectiveness of the audio-visual features.

3. Video Indexing and Search

We may look at video search as two different aspects:(1) searching for similar videos or; (2) searching for videoswhere the content may have been edited. The first purpose isfor general search enquiry, the latter is related to copyrightissues. Both are important to users with tasks related to eitherof these. The section below briefly discusses some of theapproaches to these two aspects.

3.1. Content-based Approaches

Content-based approaches are mainly to support the firstpurpose of video search. TRECVID had defined this asone of the tasks for evaluation since 2001 [7]. Until now,as reported in [8], the accuracy achieved is still far fromsatisfactory. The problem in video search is related tovideo indexing, meaning choosing essential features thatbest represents each of the video segments (can be shotsor stories). These features include low level features (suchas color, texture, etc.), temporal features such as audio andmotion (object or camera motion), and high level features(such as face, object, event, etc.). Current research in videosearch has concentrated efforts in the area of extracting highlevel features. This is also identified as one of the tasksin TRECVID benchmarking workshop. These features (low,

(a) (b) (c) (d)

Figure 1. A shot change in a music video detected by using hue and saturation tracking. The hue and saturationfor four consecutive frames (a,b,c, and d) from the video are shown. At the point where the shot change occurs,between frame (b) and frame (c) , the hue changes from 45.6 (with an average of 45.6 over the last 10 frames) to34.3. Furthermore, the saturation average changes from 12.5 (with an average of 35.2 over the last 10 frames) to169.9. Thus, the shot change is easily detected. Video courtesy of [4].

temporal and high level) form a comprehensive index for avideo to facilitate searching the video. One of the prominentworks in this area is reported in [9] and focuses on buildingthe concepts. The latest work on video search that wasdemonstrated to be effective and robust is the work doneat NUS [10] which won the 2008 Singapore Star Chal-lenge for video search. The winning system improved onthe interactive search system published in TRECVID 2007[11]. In the algorithm used in the Star Challenge Search,the developers employed their previous query dependentretrieval method which automatically discovers query classand query-high-level-features (query-HLF). This is then usedto fuse available multimodal features with other relevantfeedback to attempt to solve the interactive search problem.

3.2. Signature-based Approaches

The problem in video search arises when searching forvideos with the same content as the query videos but withdifferent conditions (color, lighting, etc.) due to differencesin capturing, or post-processing. Recent research has ad-dressed this problem and several techniques were proposedto aid in this task. Most work has introduced a frameworkthat creates the signature for a given input video. The aimwas then to have a system that automatically generates thesignature such that even severe transformations may be madeto the original video without significant changes in the videosignature. Ideally, the algorithm should be able to identifycopies of the video. There are many applications that maymake use of this technology. Video owners may use thesignature of their videos to detect any instances of copyrightinfringement, while advertisers may utilize it to monitorthe appearances of their commercials to ensure they arebeing broadcast as agreed. The signature can also be used totrack and link similar news stories since news programmesoften reuse video clips when reporting updates on an event.However, there has yet to be a method of generating a video

signature that stands out as superior to all other methods.Current methods are promising but are not yet ready foractual use, demanding further research in this area.

3.2.1. Existing Signature-based Methods. A comprehen-sive survey on the early work in the area of video retrievalmay be found in [12]. Furthermore, a study of several state-of-the-art methods for video copy detection can be foundin [13]. Chen and Stentiford [14] use the rank of regionsalong the time axis to define a temporal ordinal measure.This method is efficient and robust to many common trans-formations. However, the limitation of this method is that itis only able to detect video sequences of the same length asthe query video, and not segments of the query. Some othermethods compared in the study were [15] and [16], whereboth use signatures to describe the local regions surroundingfeature points. Both methods are robust against a varietyof transformations but the detection of feature points doesnot perform well under several conditions, such as lowcontrast or excessive noise. Work showing the importanceof invariance to monotonic non-linear operations is detailedin [17]. The ordinal method our process builds upon wasfirst proposed by Mohan in [18] and later used by Hua etal. in [19]. It is a simple and computationally inexpensivetechnique, but seems to lack robustness in certain transfor-mations. Naphade [20] suggested a colour-based signaturethat is fast and simple. However, Hampapur and Bolle [21]found it was not robust to global colour variations causedby different encoding methods and also resulted in falsepositives due to similar colour schemes in different videos.

3.2.2. Our Ordinal-based Method for Video Signature.Our proposed system consists of two sub-systems: gener-ating Video Signature and Video Matching (detecting ofmodified copies). In the first sub-system, each input videois segmented into shots using the hue color segmentationmethod we have developed. Each shot is then processed toselect its reference frames or key frames. Here we simply

select first, middle and last frames of the shot as the shotkey frames. Thus, each shot will have three representativekey frames. Next, a unique signature is created for each ofthe frames. These frame signatures in temporal order formthe signature of the video. The second sub-system is videomatching. We employ a commonly used matching techniquebased on Euclidean distance. In addition, we introduce facefeature as a primary filter to reduce the false alarm ratewhich we observed from the results in our previous work.

As a video is segmented into shots and shots are rep-resented by their reference frames, before we can generatesignatures for each of the videos, we first need to generateimage/frame signatures of the video. For this process, webuilt on the previously published work of [12], which inturn improved on [22] for video signature and sequencematching. If the image was a greyscale image to beginwith, the greyscale pixel values are used. If the image isin color, the luminance of the color image is computed andthen used (which is of added convenience given our shotdetection process utilized an RGB to HSV color transfor-mation. Individual pixels of image I may be addressed asI(x, y), x ∈ [0, . . . , width − 1], y ∈ [0, . . . , height − 1].We sub-divide this greyscale image into blocks of size p×q{B1, . . . , Bm}, given the sub-division process yields m suchblocks. Let Bk(i, j) represent the pixels at position (i, j)from block Bk, such that Bk(0, 0) is the top left corner andi ∈ [0, . . . , p − 1] and j ∈ [0, . . . , q − 1]. When comparingtwo images, each block in one image will be compared tothe corresponding block in the other image, based on therank of the pixels in each block.

Let us now consider a single block Bk from the greyscaleimage I . For each pixel in Bk, we define a slice Sk(i,j)

which is of identical dimension to I , not the dimension ofthe block. We now define the following binary operator:Sk(i,j)(x, y) = 1, if Bk(i, j) < I(x, y), else 0.

Next we define a metaslice Mk also of dimension identicalto I which is simply the component-wise addition of theslices Sk(i,j). In detail: Mk(x, y)) = sum of Sk(i,j)(x, y),where i = 0.., width− 1 and j = 0, .., height− 1.

From the metablocks, we now express the signature of animage as S(I) = {M1, . . . , Mm}.

The next step in our process is to create a video signaturefor each of the input video sequences. In our method, eachvideo sequence is represented by its reference frames. Thus,after we obtain the signatures of the reference frames for avideo sequence, these signatures form a combined signaturefor the video.

3.2.3. Similarity Matching Process. As a video is repre-sented by its reference frames, to compare any two videos,we need to compute the differences of their reference framesbased on a frame-to-frame basis. For the two videos, foreach reference frame we construct the corresponding framesignature S(I1) and S(I2). We then compute the Frobenius

norms of the differences of the corresponding metablocks (ofany two frames to be compared) of each signature. For exam-ple if S(I1) = {M1, . . . , Mm} and S(I2) = {N1, . . . , Nm}we calculate the following values: Di = ‖Mi − Ni‖F

yielding m such values {D1, . . . , Dm}. Readers will notethat as opposed to [22] where the norm squared is computed,we only use the (unsquared) Frobenius norm to aid in thesubsequent normalization process, yet to be explained. Bysumming the set of values {D1, . . . , Dm}, we obtain a singlescalar distance measure μ.

Because the resulting value μ can vary in magnitude givenvarious image sizes and block sizes chosen, the value μis normalized over these parameters. When comparing twoimages, a value of λ closer to 0 indicates similar frameswhile a value of λ closer to 1 indicates dissimilar frames.

3.2.4. Detection of Modified Copies. We further ex-tended the image matching algorithm for video matching.Assuming we have a query video Q, we may extractthe reference frames of Q = {RF1(Q), . . . , RFr(Q)}.From this we acquire a video signature for Q usingthe image signatures of the reference frames: S(Q) ={S(RF1(Q)), . . . , S(RFr(Q))}. Given V a video sequencein the database of which its generated signature (from thereference frames): S(V ) = {S(RF1(V )), . . . , S(RFs(V ))},form the video signatures S(Q) and S(V ). We may thenconstruct a table of frame differences FD(u,v), and acorresponding set of normalized differences λ(u,v) whereu ∈ [1, . . . , r] and v ∈ [1, . . . , s]. From the set of λ(u,v), wemay create a similarity matrix using a threshhold ε,

η(u,v) =

{1, if λ(u,v) < ε

0, otherwise(1)

The longest chain(s) of ones along any diagonal that runsfrom top left to bottom right is found and the keyframescorresponding to the chain(s) are returned as the sequencethat is present in both videos. Examples of returned videosretrieved by our system is presented in Figure 2 below.

4. Conclusion

We have presented the history and current methods forvideo search and retrieval. The discussion includes videosegmentation, indexing and search, as well as a videosignature method for copy detection and general matchingin which conventional video search methods encounteredseveral problems. Video information is unstructured anddynamic, thus to build a video search system that is robustand effective is still yet to be fully resolved. Trends invideo search still look into extracting semantics and buildingconcepts of high level features from individual videos suchas: object, scene, events, etc. The TRECVID benchmarkingworkshop is one particular forum that raises the awareness in

(a) (b)

(c) (d)

Figure 2. Examples of returned results from the systemin the case where the framerate has been modified.Note that (a) and (b) are keyframes from the originalqueries, and (c) and (d) are examples of returned re-sults.

video search by identifying it as one of the evaluation tasks.The general intention in the near future is that a video searchsystem will exist that is comparable to the state-of-the-art oftext search engines, such as those from Google, Yahoo, etc.

References[1] Lekha Chaisorn, A Hierarchical Multimodal Approach to

Story Segmentation in News Video, Ph.D. thesis, School ofComputing, National University of Singapore, 2004.

[2] TRECVID 2003, “Guidelines for the TRECVID 2003 Evalua-tion,” http://www-nlpir.nist.gov/projects/tv2003/tv2003.html.


[4] Various Artists, “Mv3gp.com,” 2009,http://www.mv3gp.com/3gp videos artists.php.

[5] W. Wolf, “Key frame selection by motion analysis,” .

[6] Simone Santini, “Who needs video summarization anyway?,”in Proceedings of the International Conference on SemanticComputing, 2007.

[7] TRECVID 2001, “Guidelines for the trec-2001 video track,”http://www-nlpir.nist.gov/projects/trecvid/revised.html.


[9] Shih-Fu Chang, Junfeng He, Yu-Gang Jiang, Elie El Khoury,Chong-Wah Ngo, Akira Yanagawa, and Eric Zavesky,“Columbia university/vireo-cityu/irit trecvid2008 high-levelfeature extraction and interactive video search,” in Proceed-ings of TRECVID 2008 Workshop, 2008.

[10] A*STAR, “Singapore’s star challenge,” in Press Release forGrand Finals of Star Challenge, October 2008.

[11] T. Chua, S. Neo, Y. Zheng, H. Goh, X. Zhang, S. Tang,Y. Zhang, J. Li, J. Cao, H. Luan, Q. He, and X. Zhang,“Trecvid 2007 search tasks by nus-ict,” .

[12] Daniel Chen, Lekha Chaisorn, and Susanto Rahardja, “Videosignature for copy detection employing an ordinal-basedmethod,” in In the Proceedings of SPIE: Optics and Photonics2008, San Diego, USA., August 11-14 2008, pp. 217–220.

[13] J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Grouet-Brunet, N. Boujemaa, and F. Stentiford, “Video copy detec-tion: a comparative study,” Proc. Intl. Conference on Imageand Video Retrieval, Amsterdam, The Netherlands, 2007.

[14] L. Chen and F. Stentiford, “Video sequence matching basedon temporal ordinal measurement,” Tech. Rep., 2006.

[15] A. Joly, O. Buisson, and C. Frelicot, “Content-based copy de-tection using distortion-based probabilistic similarity search,”IEEE Transactions on Multimedia, vol. 9, no. 2, pp. 293–306,2007.

[16] J. Law-To, O. Buisson, V. Grouet-Brunet, and N. Boujemaa,“Robust voting algorithm based on labels of behavior forvideo copy detection,” Proc. ACM Multimedia, Santa Bar-bara, USA, vol. 9, no. 2, pp. 835–844, 2006.

[17] Steve Mann, Intelligent Image Processing, John Wiley andSons, November 2 2001, ISBN: 0-471-40637-6.

[18] R. Mohan, “Video sequence matching,” Proc. IEEE Intl.Conference on Acoustics, Speech and Signal Processing,Seattle, USA, vol. 6, pp. 3697–3700, 1998.

[19] X.-S. Hua, X. Chen, and H.-J. Zhang, “Robust video signaturebased on ordinal measure,” Proc. Intl. Conference on ImageProcessing, Singapore, vol. 1, pp. 685–688, 2004.

[20] M. Naphade, M. Yeung, and B. L. Yeo, “A novel schemefor fast and efficient video sequence matching using compactsignatures,” Proc. SPIE Storage and Retrieval for MediaDatabases, San Jose, USA, vol. 3972, pp. 564–572, 2000.

[21] A. Hampapur, K. Hyun, and R. M. Bolle, “Comparisonof sequence matching techniques for video copy detection,”Proc. SPIE Storage and Retrieval for Media Databases, vol.4676, pp. 194–201, 2002.

[22] B. Cramariuc, I. Shmulevich, M. Gabbouj, and A. Makela,“A new image similarity measure based on ordinal correla-tion,” Proc. Intl. Conference on Image Processing, Vancouver,Canada, pp. 718–721, 2000.

Documents

[IEEE 2009 2nd IEEE International Conference on Computer Science and Information Technology - Beijing, China (2009.08.8-2009.08.11)] 2009 2nd IEEE International Conference on Computer