Svdsum2(Project Paper)

8/3/2019 Svdsum2(Project Paper)

1/12

Digital Object Identifier (DOI) 10.1007/s00530-003-0086-3

Multimedia Systems 9: 157168 (2003) Multimedia Systems Springer-Verlag 2003

Video summarization and retrieval using singular value decomposition

Yihong Gong, Xin Liu

NEC Laboratories of America, 10080 North Wolfe Road, SW3-350, Cupertino, CA 95014, USA

Abstract. In this paper, we propose novel video summariza-tion and retrieval systems based on unique properties fromsingular value decomposition (SVD). Through mathematical

analysis, we derive the SVD properties that capture both thetemporal and spatial characteristics of the input video in thesingular vector space.UsingtheseSVDproperties,we areableto summarize a video by outputting a motion video summarywith the user-specified length. The motion video summaryaims to eliminate visual redundancies while assigning equalshow time to equal amounts of visual content for the originalvideo program. On the other hand, the same SVD propertiescanalsobeused tocategorize andretrievevideoshotsbasedontheir temporal and spatial characteristics. As an extended ap-plication of the derived SVD properties, we propose a systemthat is able to retrieve video shots according to their degreesof visual changes, color distribution uniformities, and visualsimilarities.

Key words: VideosummarizationVideoretrieval Singularvalue decomposition Color histograms

1 Introduction

The widespread distribution of video images in computer sys-tems and networks has presented both excitements and chal-lenges. Video is exciting because it conveys real-world scenesmost vividly and faithfully. Handling video is challengingbecause video images are voluminous, redundant, and theiroverall contents can not be captured at a glance. With a largevideo data collection, it is always a painful task to find eitherthe appropriate video sequence, or the desired portions of thevideo. The situation becomes even worse on the Internet. Todate, more and more web sites provide video images for newsbroadcasting,entertainment,or product promotions. However,with very limited network bandwidths for most home users,people spend minutes or tens of minutes downloading volu-minous video images, only to find them irrelevant. To turn

e-mail: {ygong,xliu}@ ccrl.sj.nec.com

video collections into valuable information resources, tech-niques that enable effective content-based search, and facili-tate quick screening on the search result, become indispens-

able.Witha large amountof video data, presenting theuser with

a summary of each video program greatly facilitates the taskof finding the desired video content. Video content search andsummarization are the two essential technologies that com-plement each other. Video search engines return a set of videodata meeting certain criteria, and video summarizers producevideo content summaries that enable users to quickly graspthe overall content of each returned video. On the Internet,concise and informative video summaries are particularly im-portant to accommodate limited communication bandwidths.Video search engines serve as an information filter that siftsout an initial set of relevant videos from the database, whilevideo summarizers serve as an information spotter that helpsthe user to quickly examine through a given video set, and tospot the final set of desired video images.

To date, video summarization is mainly achieved by ex-tracting a setof keyframes from theoriginal video anddisplay-ing thumbnails of the keyframes in a storyboard window. Thedisadvantage of this approach is that keyframes are just a setof static images that contain no spatio-temporal properties ofthe original video. A video program is a continuous recordingof real-world scenes. What distinguishes the video mediumfrom the image medium is the videos capability of depictingthe dynamics and the spatio-temporal evolution of the targetscene. A set of static keyframes by no means captures theseessential video properties, and is indeed a poor representation

of general visual contents of a video program.In this paper, we propose novel video summarization and

retrieval systems based on unique properties from singularvalue decomposition (SVD). Through mathematical analysis,we derive the SVD properties that capture both the tempo-ral and spatial characteristics of the input video in the sin-gular vector space. Using these SVD properties, we are ableto summarize a video by outputting a motion video summarywith the user-specified length. The motion video summaryaims to eliminate visual redundancies while assigning equalshow time to an equal amount of visual content for the origi-nal video program. The automatically generated motion videosummaries are not intended to replace the original videos, but


2/12

158 Y. Gong, X. Liu: Video summarization and retrieval using singular value decomposition

to facilitate better visual content overviews by which the useris able to quickly figure out the general contents of the videocollection, and to judge whether the contents are of interestor not. On the other hand, the same SVD properties can alsobe used to categorize and retrieve video shots based on theirtemporal and spatial characteristics. As an extended applica-tion of thederived SVDproperties, we propose a systemthat isable to retrieve video shots according to their degrees of visual

changes, color distribution uniformities, and visual similari-ties.

In the following, Section 2 describes related work in theliterature. Section 3 outlines the proposed video summariza-tion and retrieval systems. Sections 4 to 7 describe the majorcomponents of the proposed systems. Sections 8 and 9 presentthe performance evaluations of the video summarization andretrieval systems, respectively. Section 10 summarizes the pa-per.

2 Related work

To date, video summarization is mainly achieved by usingkeyframes extracted from original video sequences. Manyworks focus on breaking video into shots, and then finding afixed number of keyframes for each detected shot. Tonomuraet al. [2] used the first frame from each shot as a keyframe.Ueda et al. [3] represented each shot using its first and lastframes. Ferman and Tekalp [4] clustered the frames in eachshot, and selected the frame closest to the center of the largestcluster as the keyframe.

An obvious disadvantage of the above equal-numberkeyframe assignment is that long shots in which camera panandzoom as well as objectmotionprogressivelyunveil theen-tire event will not be adequately represented. To address this

problem, DeMenthon et al. [5] proposed to assign keyframesof a variable number according to the activity level of thecorresponding scene shot. Their method represents a videosequence as a trajectory curve in a high dimensional featurespace, and uses the recursive binary curve splitting algorithmto find a set of perceptually significant points to approxi-mate the video curve. This approximation is repeated untilthe approximation error comes below the user specified value.Frames corresponding to these perceptually significant pointsare then used as keyframes to summarize the video contents.As thecurve splittingalgorithmassigns more points to a largercurvature, this method naturally assigns more keyframes toshots with more variations.

Keyframes extracted from a video sequence may contain

duplicates and redundancies. In a TV talk show with twotalking persons, the video camera usually switches back andforth between the two persons, with the insertion of someglobal views of the scene. Applying the above keyframe se-lection methods to this kind of video sequence will yield manykeyframes that are almost identical. To remove redundan-cies from keyframes, Yeung et al. [6] selected one keyframefrom each shot, performed hierarchical clustering on thesekeyframes based on their visual similarity, and temporal dis-tance, and then retained only one keyframe for each cluster.Girgensohn and Boreczky [7] also applied the hierarchicalclustering techniqueto group thekeyframes into asmany clus-ters as specified by the user. For each cluster, a keyframe is

selected such that the constraints of an even distribution ofkeyframes over the length of the video and a minimum dis-tance between keyframes are met. In recent clustering-basedvideo summarization Work, Dirk Farin et al. [8] proposed toincorporate domain-knowledge to eliminate the selection ofkeyframes from irrelevant or uninteresting shots. Examplesof such shots include transitional shots from fades and wipes,shots of commercial advertisements, weather forecast charts,

etc. The domain-knowledge is incorporated into the cluster-ing process by introducing the feature vectors of uninterestingscenesas additional cluster centers, andframesgrabbed by theclusters of these additional cluster centers are excluded fromthe keyframe selection process.

Apart from the above methods for keyframe selection,summarizing video content using keyframes has its own lim-itations. A video program is a continuous spatio-temporalrecording of real-world events. What distinguishes the videomedium from the image medium is the videos capability ofdepicting the dynamics and the spatio-temporal evolution ofthetarget scene.A setof statickeyframes by nomeanscapturesthese essentialvideo properties, and is indeeda poor represen-

tationof general visualcontents of a video program. To reducethe spatio-temporal content loss caused by the keyframe pre-sentation, image mosaicing techniques have been utilized tocreate a panoramic view of the entire scene recorded by eachshot [911]. For a given video shot, its mosaic is created bywarping all the frames into a common coordinate system, andthen stitching them together to reconstruct thefull backgroundseen by each of the frames in the whole shot. However, a mo-saic can be successfully created only when the target scenesatisfies several severe conditions, such as that the scene musthave no prominent 3D structures, should contain no movingobjects, etc. These strict prerequisites have certainly excludedthe majority of videos from the application range of mosaic-based video summarization methods.

There have been research efforts that strive to outputmotion video summaries to accommodate better contentoverviews. The CueVideo system from IBM provides a fastvideo playback function which plays long, static shots witha faster speed (higher frame rate), and plays short, dynamicshots with a slower speed (lower frame rate) [12]. However,this variable frame rate playback causes static shots to lookmore dynamic, and dynamic shots to look more static, there-fore it dramatically distorts the temporal characteristics of thevideo sequence. On the other hand, the Informedia systemfrom CMU provides video skim, that strives to identify andplayback only semantically important image segments alongwith semantically important audio keywords/phrases in the

video sequence [13]. The importance of each image segmentis measured using a set of heuristic rules that is highly sub-jective and content-specific. This rule-based summarizationsystem has certainly placed limitations on the handling of di-versified video images.

In recent years, content summarization of TV broadcastsports games hasreceivedincreasedattentionfromresearchersin the multimedia community. As sports videos have welldefined internal structures in which all the plays are rule-based and predictable, sports video summarization systemscan achieve higher levels of content abstraction by explor-ing domain-specific features and knowledge. Gong et al. [14]developed a soccer game parsing system that classifies each


3/12

Y. Gong, X. Liu: Video summarization and retrieval using singular value decomposition 159

shot of a soccer video according to its physical location in thefield, or the presence/absence of the soccer ball. The soccerfield line patterns, players positions and movements, and theballs presence/absence were used to recognize thecategoryofeach scene shot. Xu et al. [15] suggested that any soccer gameis composed of theplay and breakstates, anddeveloped a sys-tem that partitions soccer videos into play/break segments bylooking atgrassarea (greencolor) ratio invideoframes. Zhong

and Chang [16] focused on the fact that all the highlights inbaseball and tennis games start from pitchings and serves, re-spectively, and they strove to detect baseball pitching viewsand tennis serving views by classifying color histograms ofkeyframes of the scene shots. Rui et al. [17] assumed that ex-citing segments in baseball games are highly correlated withan announcers excited speech, and mostly occur right aftera baseball batting. Based on these assumptions, they detectedbaseball highlights based on the analysis of the announcersspeech pitch, and the detection of the baseball batting sound.

3 System overview

In this paper, we propose a novel technique for video summa-rization and retrieval based on the same framework: color his-tograms togetherwithsingularvalue decomposition.For videosummarization, we strive to createa motionvideo summary oftheoriginalvideothat: (1)has a user-specified summary lengthand granularity; (2) contains little visual redundancy; and (3)gives equal attention to equal amounts of visual content. Thefirstgoalaimsto meetdifferentcontent overview requirementsfrom a varietyof users.The secondand third goals areintendedto turn the difficult, subjective visual content summarizationproblem into a feasible and objective one. By common sense,an ideal video summary should be the one that retains only

semantically important segments of the given video program.However, finding semantically important video segments re-quires an overall understanding of the video content, whichis beyond our reach given the state of the art of current videoanalysis and image understanding techniques. The task is alsovery subjective, and it is hard to find commonly agreed uponcriteria formeasuring semantic importance.Ontheotherhand,it is relatively easy to measure the activity level of visual con-tent, and to identify duplicates and redundancies in a videosequence. For the purpose of visual content browsing andoverview, the video watching time will be largely shortened,and the visual content of the original video will not be dra-matically lost if we eliminate those duplicates/redundanciesand preserve those visually active contents.Therefore, instead

of summarizing videos by heuristically selecting importantvideo segments, we choose to create video summaries thatmeet the above summarization goals using the derived SVDproperties. The automatically generated video summaries arenot intended to replace the original videos, but to facilitatebetter visual content overviews, with which the user is ableto quickly figure out the general contents of the given videos,and to judge whether they are of interest .

As for video content retrieval, in addition to the basic re-quirement of retrieving the shots from the database that arevisually similar to the sample shot, we also strive to real-ize the retrieval of shots according to their degrees of visualchanges and color distribution uniformities. As demonstrated

Video sampling

Feature-Framematrix creation

Singular valuedecomposition

Video sequence

Temporal/spatialcharacteristicsmeasurement

Shot boundary detectionand keyframe extraction

Frame clustering based onthe visual content metric

Summary generation

Video retrieval subsystem

Video summarization

subsystem Database insertion

Fig. 1. Block diagram of the video summarization and retrieval sys-

tems

in Sect. 9, these new video retrieval capabilities are valuablefor spotting video shots that either contain visually active con-tents,or consist of frameswith skewedcolordistributions such

as black frames, white frames, frames with very few colors,etc. These shots could either indicate semantically importantcontents, suchas quickly evolving events, important messagesdisplayed on uniformly colored background, etc., or representvisually trivial contents that arise from flashlights, transitionalperiods such as fades, wipes, etc.

Figure 1 shows a block diagram of the proposed videosummarization and retrieval systems. To reduce the numberof frames to be processed by the SVD, we roughly sample theinput video sequence with a fixed rate of five frames/second.Our experiments have shown that this sampling rate is suf-ficient for video programs without many dramatic motions,such as news, documentaries, talk shows, etc. For each framei in the sampling set, we create an m-dimensional featurevector Ai. Using Ai as column vector i, we obtain the feature-frame matrix A = [A1 A2 An]. Performing SVD onmatrix A will project each frame i from the m-dimensionalraw feature space into a -dimensional singular vector space(usually m). Through mathematical analysis, we deriveunique SVD properties that capture both the spatial and tem-poral characteristics of the input video in the singular vectorspace. By using these SVD properties, we are able to achievethegoals we setforboth thevideo summarization andretrievalat the beginning of this section.

As shown in Fig. 1, both the video summarization and re-trievalsystemssharethe operationsof videosampling, feature-frame matrix creation, and singular value decomposition. The


4/12


operations becomedifferent for thetwo systems after thevideoframes are projected into the singular vector space. However,both systems make use of the same set of singular vectors, andno further feature extraction processes are involved in theirsubsequent operations. The details of the major operations inFig. 1 are described in subsequent sections.

4 Feature-frame matrix creation

The video summarization and retrieval systems start fromfeature-frame matrix creation. From a wide variety of imagefeatures, we selected color histograms to represent each videoframe. As demonstrated in Sect. 7, the combination of colorhistograms and SVD captures the information of color distri-bution uniformity for each frame.Further,histograms areverygood for detecting overall differences in images [18], and arecost-effective for computing. Using cost-effective histogramshere ensures feasibility and processing speed of the systemin handling long video sequences. In our system implementa-tion, wecreatethree-dimensionalhistograms in theRGBcolor

space with five bins for R, G, and B, respectively, resulting ina total of 125 bins. To incorporate spatial information of thecolor distribution, we divide each frame into 3 3 blocks,and create a 3D-histogram for each of the blocks. These ninehistograms are then concatenated together to form a 1125-dimensional feature vector for the frame. Using the featurevector of frame i asthe ith column, wecreatethe feature-framematrix A for the video sequence. Since a small image blockdoes not normally contain all kinds of colors, matrixA is usu-ally sparse. Therefore, SVDalgorithms forsparsematricescanbe applied here, which are much faster and memory-efficientas compared to regular SVD algorithms.

5 Singular value decomposition (SVD)

Given an m n matrix A, where m n, the SVD ofA isdefined as [19]:

A = UVT (1)

where U = [uij] is an m n column-orthonormal ma-trix whose columns are called left singular vectors; =diag(1, 2, . . . , n) is an n n diagonal matrix whose di-agonal elements are non-negative singular values sorted indescending order; and V = [vij ] is an n n orthonormalmatrix whose columns are called right singular vectors. Ifrank(A) = r, then satisfies

1 2 r > r+1 = = n = 0 (2)

In our video summarization and retrieval systems, ap-plying SVD to the feature-frame matrix A can be inter-preted as follows. The SVD derives a mapping between them-dimensional raw feature space spanned by the color his-tograms and the r-dimensional singular vector space with allof its axes linearly-independent. This mapping projects eachcolumn vector Ai = [a1i a2i ami]

T of matrix A, whichrepresents the concatenated histograms of frame i, to columnvector i = [vi1 vi2 vir]

T of matrix VT, and projectseach row vector j of matrix A, which tells the occurrencecount of the concatenated histogram entry j in each of the

Fig. 2. Performance evaluation of the SVD-based shot boundary de-

tection using as a parameter

video frames, to row vector j = [uj1 uj2 ujr ] of matrixU.

The SVD requires that matrix As number of rows m isgreater thanor equal to its number of columns n. If the numberof frames is more than the number of elements in each con-catenated histogram, the SVD must be carried out onAT, andconsequently, the role of matrixU andV, which is explainedabove, will be exchanged. For simplicity, without loss of gen-erality, only the processing of matrix A will be described inthe following part of this paper.

The SVD has the following dimension reduction propertythat has been widely utilized by many applications in manyareas (see [20] for proof).

Theorem 1. Let the SVD of matrixA begiven byEq. (1),U =[U1U2 Un], V = [V1V2 Vn] , and rank(A)=r. MatrixA ( r) defined below is the closest rank- matrix to A

for the Euclidean and Frobenius norms.

A =i=1

Ui i VTi (3)

The use of-largest singular values to approximate theoriginal matrixwith Eq. (3)hasmore implications than just di-mension reduction. Discarding small singular values is equiv-alent to discarding linearly semi-dependentor practically non-essential axes of the singular vector space. In our case, axes

with small singular values usually capture eithernon-essentialcolor variations or noise within the video sequence. The trun-cated SVD, in one sense, captures the most salient underlyingstructure in the association of histograms and video frames,yet at the same time removes the noise or trivial variations invideo frames. Minor differences between histograms will beignored, and video frames with similar color distribution pat-terns will be mapped near to each other in the -dimensionalpartial singular vector space. Similarity comparison betweenframes in this partial singular vector space will certainly yieldbetter results than in the raw feature space.

To demonstrate the impact of discarding small singularvalues on thesimilarity comparison between video frames, we


5/12


implemented a shot boundary detectionmethodin thesingularvector space, and evaluated the performance of the methodusing as a parameter. More precisely, for each frame i, wetake the column vector i of matrixV

T as its feature vector,and use Eq. (4) as the similarity metric to compare it withframe j = i + 1. Once the difference between frame i and jexceeds the predefined threshold, we declare the detection ofa shot boundary:

D(i, j) =

k=1

k(vik vjk)2 (4)

In fact, Eq. (4) defines a Euclidean distance weighted by thesingular values k. is the parameter that specifies how manysingular values are to be used in the metric.

Theshotboundarydetectionmethod isevaluatedusingtwocommon measures, recall and precision, which are defined asfollows:

Recall =No. of correctly detected boundaries

No. of true boundaries

Precision =

No. of correctly detected boundaries

No. of totally detected boundaries

The evaluation was performed using a set of TV video pro-grams with a total lengthof three hours.These video programsconsist of a great variety of scene categories taken from news,documentaries, movies, TV commercials, etc. All the videoprograms are in MPEG1 format with a frame sizeof352240pixels. Figure 2 shows theevaluation result with the value ofas a parameter. For each given value, we empirically deter-mined the shot boundary detection threshold so that the bestrecall and precision were obtained. It can be seen from thefigure that when takes values below 30, the shot boundarydetection yields poor recall and precision (below 0.5). When equals 50, both the recall and precision reach their maxi-mum. When further increases, the recall decreases slightlyand then flattens out, whereas the precision decreases by 10%and then stabilizes at that level. Since the raw feature vector ofeach video frame has 1125 dimensions, theoretically the SVDoperation can produce up to 1125 singular values. However,because these 1125 dimensions are not fully independent ofeach other, the SVD seldom produces 1125 non-zero singularvalues. Our experiments have shown that the singular valuesfrom rank 600 and onward are either very small or could beignored.

Basedontheabove experiments,we set thevalue ofto50,and use this value as well as Eq. (4) as the similarity metric forboth the video summarization and the retrieval systems. With

= 50, the best threshold for the shot boundary detectionequals 8000.

For comparison, we also implemented the same shotboundary detectionmethodusing theraw feature vectors.That

is, instead of using the columnvector i of matrixVT, weuse

the column vector Ai of matrix A for each frame i. The Eu-clidean distance between tworaw feature vectors is used as thesimilarity metric. The performance evaluation using the sametest video set yielded a 73% recall and 67% precision. Thisperformance is compatible to that of the singular vector-basedmethod using the full set of singular values. This comparisonis further evidence for the impact of discarding small singularvalues on the similarity comparison between video frames.

6 SVD-based video summarization

Besides the SVD properties described in the above section,we have derived the following SVD feature, which constitutesthe basis of our video summarization system (see Appendix 1for proof).

Theorem 2. Let the SVD ofA be given by Eq. (1), A =[A1 Ai An], V

T = [1 i n]. Define the normofi = [vi1 vi2 vin]T in the singular vector space as:

||i|| =

rank(A)j=1

v2ij (5)

If rank(A)=n, then, from the orthonormal property of ma-

trix V , we have ||i||2 = 1, where i = 1, 2, . . . , n. Let

A = [A1

k A(1)i A

(k)i An] be the matrix obtained by

duplicating column vector Ai in A k times (A(1)i = =

A(k)i = Ai),andV

T = [1

k

1

k

n] be thecorre-sponding right singular vector matrix obtained from the SVD.Then, ||j||

2 = 1/k, where j = 1, 2, . . . , k.

The above theorem indicates that, if a column vector Aiof matrix A is linearly-independent, the SVD operation willproject it into the vector i whose norm defined by Eq. (5) isone in thesingular vector space.When Ai has some duplicates

A(j)i , the norm of its projected vector

j decreases. The more

duplicates Ai has, the smaller norm

j holds. Translating thisproperty into the video domain, it can be inferred that, in thesingular vector space, frames in a static video segment (e.g.shots of anchor persons, weather maps) will be projected intothe points closer to the origin, while frames in a video segment

containing a lot of changes (e.g. shots containing moving ob-jects, camerapan andzoom)will be projectedinto thosepointsfurthest from the origin. In other words, by looking at the lo-cation at which a video segment is projected, we can roughlytell the degree of visual changes of the video segment.

Consider a video and ignore its audio signals. From theviewpoint of visual content, a static video with little visualchange contains less visual content than a dynamic video withlots of changes. In other words, the degree of visual changesin a video segment is a good indicator of the amount of visualcontent conveyed by the segment. Since the degree of visualchanges of a given video segment has a strong correlationwith the location of its corresponding clusterSi in thesingular

vector space, we define the following quantity as a metric ofvisual content contained in cluster (video segment) Si:

CON(Si) =iSi

||i||2 (6)

Using the above visual content metric, we strive to groupvideo frames into units with approximately equal amounts ofvisual content in the singular vector space. More precisely, inthe singular vector space, we first find the most static framecluster, define it as the visual content unit, and then use thevalue of the visual content metric computed from it as thethreshold to cluster the rest of the frames. The main opera-tions of the video summarization system are given as follows(Fig. 1).


6/12


Main process

Step 1. Roughly sample the input video sequence with a fixedrate.

Step 2. Create the feature-frame matrixA using the framesin the sampling set (see Sect. 4).

Step 3. Perform the SVD onA to obtain matrixVT in whicheach column vector i represents frame i in the singularvector space.

Step 4. In the singular vector space, find the most static clus-ter, compute the value of its visual content metric usingEq. (6), and cluster the rest of the frames into units thathave approximately the same amount of visual content asthe most static cluster.

Step 5. For each cluster Si obtained, find the longest shoti contained in the cluster. Discard the cluster whose iis shorter than one second. Output a summarized motionvideo with the user-specified time length.

In Step 4 of the above operation, finding the most staticcluster is equivalent to finding the cluster closest to the origin

of the singular vector space. Referring to the notation in The-orems 1 and 2, the entire clustering process in Step 4 can bedescribed as follows.

Clustering

1. In the singular vector space, sort all the vectors i in as-cending order of their norms defined by Eq. (5). Initial-ize all the vectors as unclustered vectors, and set clustercounter C = 1.

2. Among the unclustered vectors, select the one that hasthe shortest norm as the seed to form cluster SC. Set the

average internal distance of the cluster R(SC) = 0, andthe frame count PC = 1.

3. Foreach unclustered vector i, calculate itsminimum dis-tance to cluster SC, which is defined as:

dmin(i,SC) = minkSC

D(i, k) (7)

where D(i, k) is defined by Eq. (4). If cluster counterC = 1, go to Case (a); otherwise, go to Case (b).(a) add frame i to cluster S1 if

R(S1) = 0 or

dmin(i,S1)/R(S1) < 5.0

(b) add frame i to cluster SC if

R(SC) = 0 or

CON(SC) < CON(S1) or

dmin(i,SC)/R(SC) < 2.0

If frame i is added to cluster SC, increment frame countPC by one, update the content value CON(SC) usingEq. (6), and update R(SC) as follows:

R(SC) =(PC 1)R(SC) + dmin(i,SC)

PC(8)

4. If there exist unclustered points, increment the clustercounter C by one, go to Step 2; otherwise, terminate theoperation.

In the above operations, it should be noticed that differ-ent conditions are used for growing the first and the rest ofthe clusters. The first cluster relies on the distance variationdmin(i,S1)/R(S1) as its growing condition, while the re-

maining clusters examine the visual content measure as wellas the distance variation in the growing process. Condition 2in Case (b) ensures that the cluster under processing containsthe same amount of visual content as the first cluster, whileCondition 3 prevents two frames that are very close to eachother from being separated. With Condition 2, a long videoshot with large visual variations will be clustered into morethan onecluster, andconsequently, will be assigned more thanone slot in the summary. On the other hand, with the combi-nation of Conditions 2 and 3, video shots with very similarvisual content will be clustered together, and only one slotwill be assigned to this group of video shots. These character-istics exactly meet our goals set for the video summarizationsystem.

In the main process, Step 5 forms another unique charac-teristic of our video summarization system: it is able to outputa summarized motion video of the original video sequencewith the user-specified time length and granularity. The sys-tem composes a summarized video according to the two userinputs: the time length of the summarized video Tlen, and theminimum time length(thegranularity)eachshot shouldbe dis-played in the summarized video Tmin. The process consists ofthe following main operations:

Summary composition

1. Let C be the number of clusters obtained from the aboveclustering process, and N = Tlen/Tmin. For each clusterSi, find the longest video shot i.

2. IfC N, go to Case (i); otherwise, go to Case (ii).(i) Select all the shots i where i = 1, 2, . . . , C , and

assign an equal time length L = Tlen/Cto each of theshots.

(ii) Sort shots i in descending order by length, select thetop Nshots,andassign an equal time length L = Tminto each selected shot.

3. From each selected shot, take the first L seconds of theshot, and concatenate these video segments in their origi-nal time order to form the motion video summary.

Given the users input Tlen and Tmin, the maximum num-ber of shots the summarized video can include equals N =Tlen/Tmin. If the total number of shots C N, then all theshots will be assigned a slot in the summarized video (Case(i)); otherwise, the shots will be selected in descending orderof length to fill the summarized video. Here, the parameterTmin can be considered as a control knob for the user to selectbetween depth-centric and breadth-centric summarization. Asmall valueforTmin will produce a breadth-centricvideo sum-mary (or summary with a small granularity), which consistsof more shots that are shorter in length, while a large valuefor Tmin will produce a depth-centric video summary (or sum-mary with a large granularity), which consists of fewer shots


7/12


8/12


Table 1. Evaluation for video summarization

Time Total Similar shots Dynamic shots

Length Shots Properly merged Assigned more slots

120 min. 683 79% 86%

8 Evaluation of video summarization

Conductingan objective andmeaningfulevaluationfora videosummarization method is particularly difficult and challeng-ing, and is an open issue deserving more research. The chal-lenges are mainly from the fact that research for video sum-marization is still at an early stage, and there are no agreed-upon metrics for performance evaluations. These challengesare further compounded by the fact that different people carrydifferentopinionsandrequirements towardsvideosummaries,making the creation of any agreed-upon performance metricseven more difficult.

With the above observations in mind, we choose to eval-uate our video summarization system using the objectives set

at the beginning of Sect. 3, which are (1) adjustable sum-mary length and granularity, (2) redundancy reduction, and(3) equal attention to an equal amount of visual content. Therealization of the first objective is obvious from the algorithmitself. The redundancy reduction can be measured by show-ing how many visually similar or duplicated shots have beendeprived, and how many static shots with little visual changehave been shortened. For the objective of equal attention toan equal amount of visual content, we show the percentage oflong, dynamic shots that have been assigned more play time inthe summaries produced. We used a portion of the video dataset described in Sect. 5 that excludes TV commercials andmany short video programs which are not very meaningful for

summarization. The test data set has a total of two hours inlength, and consists of news reports, documentaries, politicaldebates, talk shows, and live coverage of breaking events.

Figure 3 details a one minute summary of a six-minutenews report covering the ClintonLewinsky scandal. The se-quence consists of 29 shots, and Fig. 3 displays the 15 majorshots. Each row on the left hand rectangle represents a shot inthe original video, and the number of frames in each row isproportional to the time length of the corresponding shot. Thesame row on the right hand rectangle depicts the number ofslots assigned to thecorresponding shot.Each slot hasan equalplay time that are calculated using the algorithm described inSect. 6. In our experiment, the thirteenth shot (represented byrow 13) was detected as the most static shot, and was used

as the visual content unit to cluster the rest of the shots. Theanchor person appeared two times, one at the beginning (row1), the other at the end (row 15) of the whole sequence. How-ever, as the two shots are quite static and visually similar, theywere clustered together, andwere assigned only oneslot in thesummary (row 1 on the right hand rectangle). The similar sit-uation occurs for shot 2 and 14 as well. Shot 12 is the longestshot, and contains lots of changes in the whole sequence. Itwas clustered into three clusters together with shot 10, andassigned three slots. Similarly, as shot 5 contains many visualchanges, it was also assigned two slots. Figure 3 demonstratesthat this one minute motion video summary has largely metthe objectives we set for video summaries.

Table 2. Evaluation for video retrieval

Recall Precision

Similarity retrieval 76% 72%

Color Uniformity 91% 94%

Dynamic Degree 94% 95%

Table 1 shows the overall evaluation resulton the two hourtest video set. The result shows that our video summarizationmethod has an 86% accuracy for segmenting long, dynamicshots into multiple clusters, hence giving more attention tothis type of shot. For merging visually similar or duplicatedshots, our method shows a 79% accuracy. The failure of merg-ing some visually similar shots occurs mainly when two shotshave similar color distributions but different illumination con-ditions, or position shifts of the main objects (e.g. anchor per-sons, buildings).

9 Evaluation of video retrieval

The video retrieval system is evaluated using the same threehour video set used for testing the shot boundary detectionmethod. Each video program in the test set has been seg-mented into individual shots, and this shot segmentation pro-cess hasyielded 1100 camera shots, and hence produced 1100keyframes.

The video retrieval system supports three basic types ofvideo search: search based on the visual similarity, the colordistribution uniformity, and the degree of visual changes. Anycombination of the three basic search types arealso supported.Forsimilarity-based videosearch, theuser must present a sam-ple frame of the desired video shot. The system matches the

sample frame with all the keyframes stored in the database,and retrieves 16 shots whose keyframes are among the top 16matches. On theother hand, searchingshots based on thecolordistribution uniformity and the degree of visual changes aremainly realizedusingthepredefinedthresholds.With the colordistribution uniformity metric defined by Eq. (10), keyframeswith a value above 19,000 are composed of very few colors,with their histograms containing many empty bins. In contrast,keyframes with a valuebelow4000arecomposedof many col-ors, and relatively uniform color distributions are observed intheir histograms. On the other hand, using the visual changemetric defined by Eq. (5), shots with the value above 2000 canbe categorized as the dynamic shots containing high degreesof visual changes, while shots with the value below 1000 can

be categorized as static shots with very little change.Figure 4 shows examples of video retrieval based on the

visual similarity, the color distribution uniformity, and the de-gree of visual changes. Figures 4(a), (b), and (c) display thetop 16 returns from the database using the respective retrievalmethods. In Figure 4(a), a frame from a scene of the Clintontestimony before Kenn Starr is used as a sample frame (dis-played in the top large window). and the system returns thetop 16 matches, 12 of which belong to the other shots of theClinton testimony. The rest of the four images are similar ei-ther with the overall color layout and distribution (thirteenthand fourteenth images), or with the main object (fifteenth andsixteenth images). In Figure 4(b), retrieval of shots with skew


9/12


Fig. 3. A video summarization result

color distributions is conducted, and the 16 keyframes shownin the picture are either black-white images, or composed ofvery few colors. In Figure 4(c), the results of retrieving dy-namicshotsaredisplayed,and theyrepresent shotsthat containdissolves, fast moving objects, camera panning, etc. For thesecondand third typesof videoretrieval, if noother constraintsare imposed, the system will return all the keyframes that sat-isfy the predefined threshold. If more than 16 keyframes arereturned from the database, the arrow buttons at the bottom ofthe window can be used to move to the previous, or the next,page of the returned keyframes.

The video retrieval system is evaluated using the recalland precision whose definitions are analogous with those de-fined in Sect. 5. Table 2 shows experimental results using thethree hour test video set. It is clear from the table that, whilesimilarity-based video retrieval has obtained reasonable per-formance, video retrieval based on the color uniformity anddegree of visual changes have achieved impressive recall andprecision. This result has demonstrated the effectiveness ofthose SVD derived metrics described in Sect. 7.

10 Summary and discussion

In this paper, we have proposed a novel framework that re-alizes the video summarization and retrieval by sharing thesame features and the same singular vector space. Throughmathematical analyses, we have derived the SVD propertiesthat capture both the temporal and spatial characteristics ofthe input video sequence in the singular vector space. Usingthese SVD properties, we are able to categorize and retrievevideo shots according to their degrees of visual changes, colordistribution uniformities, and visual similarities. On the otherhand, these SVD propertiesalso enable us to derivea metric to

measure thevisualcontent valueof a given video segment.Us-ing this metric, we are able to group video frames into clusterswith approximately the same visual content value. With thisclustering technique, we strive to generate video summariesthat (1) have an adjustable length and granularity controllableby users, (2) contain little redundancy, and (3) give equal at-tention to the same amount of visual content.

In the experimental evaluations, we have used the aboveobjectives to measure the effectiveness of the video summa-rization system, and used the common recall and precisionmetrics to evaluate the performance of the video retrieval sys-tem. The evaluation results have shown that the video sum-marization system has certainly met our objectives, and the


10/12


a Fig. 4ac. Video retrieval examples

video retrieval system has revealed the effectiveness of re-trieving video shots based on the visual similarity, the color

distribution uniformity, and the degree of visual changes.TheSVDand certain SVDproperties (e.g.Theorems 1 and

2) can beapplied toother image featuresaslongas the featureshave a fixed number of dimensions. The advantages of usingSVD will become more obvious with high dimension features(e.g. more than 100 dimensions), because SVD can tell whichdimensions are important and which are not. As showcased inSect. 5 and in [21], in singular vector spaces, discarding unim-portant dimensions is more than just the dimension detection;in many cases, it improves system performance.

Ourfuturework includes theimprovement of thesummarycomposition process so that the generated motion video sum-

mary will have a better audio effect, the testing of the singu-lar value decomposition on different color spaces to compare

them with the RGB color space.Weare alsogoing to apply theSVD to other image features such as edges, textures, etc., andconduct performance comparisons with the histogram+SVDcombination.

Appendix 1: Proof of Theorem 2

Assume that except for k duplicates ofAi, the rest of the col-umnvectors inA are all linearly-independent.By performingthe SVD onA, and, without loss of generality, by conducting

some permutations, we have matrix VT

as shown in Fig. 5.In the figure, the row vectors from 1 to n are the right sin-


11/12


b

c Fig. 4ac. (continued)

gular vectors whose corresponding singular values are non-zero, and each column vector j = [yj1 yj2 yj(n+k1)],j = 1, . . . , k, in the hatched rectangle area corresponds to

A(j)i in A

. Because the SVD projects the identical columnvectors in A to the same point in the refined feature space,the following condition holds:

y1s = y2s = = yks where 1 s n (11)

BecauseVT

is an orthonormal matrix,

a

b=ab (12)

c

d=0 (13)

where a,

b,

d represent any column vectors from thehatched rectangle area, and c represents any column vec-tor in the dotted rectangle area. From Eq. (11) and Eq. (12),the condition in Eq. (11) does not hold for n < s n + k1.From Eq. (11) and Eq. (13), elements ofn + 1 to n + k 1 in


12/12


VT

n

k(n+k-1) (n+k-1)

0

...... ...1

c n-1

n-1

k-1

k1

Fig. 5. The structure of matrixVT

with some permutations

each vector c all equal zero. From the orthonormal property

ofVT

ki=1

n+k1s=n+1

y2is = k 1 (14)

ki=1

n+k1s=1

y2is = k (15)

subtracting Eq. (14) from Eq. (15), we have

ki=1

ns=1

y2is = 1 (16)

From Eq. (11) and Eq. (16), we have

||j ||2 = 1/k, where 1 j k (17)

Appendix 2: Proof of Theorem 3

Assume that the SVD ofA is as given by Eq. (1). Let

A = [A1 Ai An], VT = [1 i n], Ai =

[a1i a2i ami]T, and i = [vi1 vi2 vin]

T. Thenwe have

ATA = VUT UVT = V2VT (18)

The last step makes use of the orthonormal property of ma-trix U. In the above equation, the diagonal element i of theleft-hand side matrix ATA equals Ai Ai =

mi a

2ji , and

the diagonal element i of the right-hand side matrixV2V

T

equalsn

j=1 2jv

2ij . Therefore, we have ||i||

2 = Ai Ai =m

j=1 a2ji .

References

1. Chang S-F, Eleftheriadis A, McClintock R (1998) Next-

generation content representation, creation and searching for

new media applications in education. IEEE Proc (special issue

on multimedia signal processing) 86:884904

2. Tonomura Y, Akutsu A, Otsuji K, Sadakata T (1993) Videomap

and videospaceicon: Tools for anatomizing video Content. Pro-

ceedings ACM INTERCHI93, Amsterdam, The Netherlands

3. Ueda H, Miyatake T,Yoshizawa S (1991) Impact: an interactive

natural-motion-picture dedicated multimedia authoring system.

Proceedings ACM SIGCHI91, New Orleans, LA

4. Fermain A, Tekalp A (1997) Multiscale content extraction and

representation for video indexing. Proc SPIE 3229: pp 2331

5. DeMenthon D, Kobla V, Doermann D (1998) Video summa-

rization by curve simplification. Technical Report LAMP-TR-

018, Language and Media Processing laboratory, University of

Maryland, MD6. Yeung M, Yeo B, Wolf W, Liu B (1995) Video browsing using

clustering and scene transitions on compressed sequences. Proc

SPIE 2417: pp 399413

7. Girgensohn A, Boreczky J (1999) Time-constrained keyframe

selection technique. Proc IEEE Multimedia Computing and

Systems (ICMCS99), Florence, Italy

8. Farin D, Effelsberg W, de With PH (2002) Robust clustering-

based video summarization with integration of domain knowl-

edge. Proceedings IEEE International Conference an Multime-

dia and Expo, Lausanne, Switzerland

9. Irani M, P. Anandan ) (1998) Video indexing based on mosaic

representations. IEEETransPatternAnaly Machine Intell 86(5):

pp 905921

10. GelgonM, Bouthemy P (1998)Determining a structured spatio-temporal representation of video content for efficient visualiza-

tion and indexing. Proccedings European Conference on Com-

puter Vision (ECCV), Freiburg, Germany

11. Aner A, Tang L, Kender JR (2002) A method and browser for

cross-referenced video summaries. Proceedings IEEE Interna-

tionalConferenceon Multimedia andExpo, Lausanne,Switzer-

land

12. Ponceleon D, Amir A, Srinivasan S, Mahmood T, Petkovic D

(1999) Cuevideo:Automated multimediaindexingandretrieval.

Proceedings ACM Multimedia, Orlando, FL

13. Smith MA, Kanade T (1997) Video skimming and characteri-

zation through the combination of image and language under-

standing techniques. Proceedings CVPR97, Puerto Rico, pp

77578114. Gong Y, Sin LT, Chuan CH, Zhang H, Sakauchi M (1995) Au-

tomatic parsing of tv soccer programs. IEEE International Con-

ference on Multimedia Computing and Systems, Boston, Mas-

sachusetts, pp 167174

15. Xu P, Xie L, Chang SF, Divakaran A,Vetro A, Sun H (2001)

Algorithms and system for segmentation and structure analysis

in soccer video. IEEE Conference on Multimedia and Expo,

Tokyo, Japan,pp 928931

16. Zhong D, Chang SF (2001) Structure analysis of sports video

using domain models. IEEE Conference on Multimedia and

Expo, Tokyo, Japan, pp 920923

17. RuiY, GuptaA, AceroA (2000)Automatically extracting high-

lights for tv baseball programs. Eighth ACM International Con-

ference on Multimedia, LosAngeles, California, pp 10511518. Swain MJ (1993) Interactive indexing into image databases.

SPIE 1908: pp 95-103

19. Press W et al. (1992) Numerical recipes in C:the art ofscientific

computing, 2nd ed. Cambridge University Press,Cambridge

20. Golub G, Loan C (1989) Matrix computations, 2nd ed. Johns-

Hopkins, Baltimore

21. Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R

(1990) Indexing by latent semantic analysis. J Am Soc Infor Sci

41:391407

Documents

Svdsum2(Project Paper)