Statistical sequential analysis for real-time video scene

106 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 5, NO. 1, MARCH 2003

Statistical Sequential Analysis for Real-TimeVideo Scene Change Detection onCompressed Multimedia Bitstream

Dan Lelescu and Dan Schonfeld

Abstract—The increased availability and usage of multimediainformation have created a critical need for efficient multimediaprocessing algorithms. These algorithms must offer capabilitiesrelated to browsing, indexing, and retrieval of relevant data. Acrucial step in multimedia processing is that of reliable videosegmentation into visually coherent video shots through scenechange detection. Video segmentation enables subsequent pro-cessing operations on video shots, such as video indexing, semanticrepresentation, or tracking of selected video information. Sincevideo sequences generally contain both abrupt and gradual scenechanges, video segmentation algorithms must be able to detecta large variety of changes. While existing algorithms performrelatively well for detecting abrupt transitions (video cuts), reli-able detection of gradual changes is much more difficult. In thispaper, a novel one-pass, real-time approach to video scene changedetection based on statistical sequential analysis and operatingon compressed multimedia bitstream is proposed. Our approachmodels video sequences as stochastic processes, with scene changesbeing reflected by changes in the characteristics (parameters) ofthe process. Statistical sequential analysis is used to provide anunified framework for the detection of both abrupt and gradualscene changes.

Index Terms—Abrupt and gradual scene changes, scene changedetection, video segmentation, video shots.

I. INTRODUCTION

G IVEN the rapid expansion in the volume of multimediadata and the increasing demand for fast access to rele-

vant data, the need for parsing, interpretation, and retrieval ofmultimedia information has become of paramount importance.Algorithms designed for multimedia processing must offer ca-pabilities such as browsing, indexing, tracking, and retrieval ofrelevant information. Since the original data are compressed topreserve storage space, and decompression is extremely costly,the algorithms designed to operate with multimedia data shouldideally do so in the compressed (or minimally-decoded) do-main. An important step in processing multimedia informationis the parsing of video into relatively visually-coherent segments

Manuscript received March 28, 2000; revised December 18, 2001. This workwas performed while the first author was with the University of Illinois-Chicago.The associate editor coordinating the review of this paper and approving it forpublication was Dr. Tat-Seng Chua.

D. Lelescu is with DoCoMo Communications Laboratories, San Jose, CA95110 USA (e-mail: [email protected]).

D. Schonfeld is with the Multimedia Communications Laboratory, Depart-ment of Electrical and Computer Engineering University of Illinois-Chicago,Chicago, IL 60607-7053 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TMM.2003.808819

calledvideo shots, delimited by scene changes. Video segmen-tation is needed to enable subsequent processing operations onvideo shots, such as video indexing, semantic representation,or tracking of selected video information. After segmentationinto shots, video indexing can be performed to facilitate a com-pact representation of information in each video shot, to be usedfor later retrieval. Algorithms can also perform video analysisor retrieval by tracking specific objects in video shots, as itwas presented for example in [1]. In this paper, we propose anew real-time approach for detecting scene changes based onstatistical sequential analysis, and operating on minimally-de-coded video bitstreams. The algorithms developed were testedon videos encoded using the MPEG-2 video compression stan-dard. However, they can be easily extended for use with differentvideo compression methods.

In general, video sequences contain two main types of scenechanges: abrupt and gradual. With the increasing use of spe-cial effects as a method of transition from one video shot to an-other, it should be assumed that video sequences contain bothabrupt and different types of gradual changes. Thus, the algo-rithms for scene change detection should not be designed onlyfor detection of abrupt transitions, or specific gradual changessuch as fades or dissolves. A challenge in video segmentationis that the detection algorithm must be able to handle all typesof scene changes. Whereas abrupt changes (cuts) are relativelyeasy to detect, gradual changes present more difficulty for de-tection. There is a large variety of gradual transitions that canbe produced. A gradual scene change usually takes place overa variable number of frames (having different rates of change).Generally, the video change from frame to frame is small. Thus,it is difficult to determine directly in the compressed domain thata gradual change is taking place without considering a suffi-ciently large number of pictures. It is also desirable to formalizethe analysis of video scene changes beyond the use of empiricalobservations about the nature of the transitions, which may ormay not hold in all cases.

In order to address these challenges, we pose the scenechange detection problem in the context of statistical sequentialrepresentation and analysis of the video information. Thenature and volume of video data, and the characteristics ofvideo scene changes are well-suited to a stochastic modeling.Thus, a video sequence is considered to be a realization ofa stochastic process. Scene changes are considered to berepresented by changes in the characteristics (parameters) ofthe stochastic process. We use statistical analysis to monitorparameters of the stochastic process, to allow the design of an

1520-9210/03$17.00 © 2003 IEEE

LELESCU AND SCHONFELD: STATISTICAL SEQUENTIAL ANALYSIS 107

unified approach to scene change detection for various typesof changes (both abrupt and gradual). Most special effectshave particular characteristics that make them hard to detectwith deterministic methods, usage of simple statistics (e.g.,intensity variances, histograms), and considering only a smallnumber of pictures at a time, which has been typically the casein existing algorithms. In our approach, through use ofdi-mensionality reductionandsufficient statistics, a large enoughnumber of pictures can be analyzed for gradual scene changedetection. The dimensionality reduction is achieved by utilizingan efficient implementation of an optimal transformation(principal component analysis). The sufficient statistic usedis based on the concept of generalized likelihood ratio. Thus,the main steps used for detecting video scene changes are thedimensionality reduction of the original video data followed bychange detection on the resulting feature vectors.

The use of the statistical sequential analysis facilitates a uni-fied approach for detecting both abrupt and gradual transitions,since any scene change is modeled as a change in parameters ofthe stochastic process. Thus, the dynamics of the video data dic-tates when a scene change is detected. The evidence of a scenechange as reflected by changes in a sequentially computed teststatistic is used for detection. Noa priori assumptions must bemade about the specific model of the scene change (e.g., linear).No predefined time window needs to be used for analyzing thedata to detect gradual scene changes, since theoretically, all thesamples from the beginning of the current video shot can berepresented in the test statistic. Thus, variable-length gradualchanges can be detected, by allowing the statistical data per-taining to a scene change to accumulate and trigger a detection.The sequential aspect of the approach enables the one-pass de-tection of scene changes.

The resulting algorithms use one threshold applied to the teststatistic that responds to scene changes, for the purpose of iso-lating the transitions. Therefore, the disadvantage of using diffi-cult to adjust combinations of multiple thresholds found in someexisting scene change detection approaches is eliminated. Thenovel scene change detection approach presented in this paperoffers a one-pass, real time solution to detecting both abruptand gradual changes by operating on minimally-decoded videobitstreams. To illustrate this approach, additive and nonadditivemodels of video scene changes and the performance of the cor-responding detection algorithms are discussed in the context ofdifferent parameterizations.

This paper is organized as follows. Section II covers relatedwork in the area of video segmentation. In Section III, thedimensionality reduction and representation of the originalvideo data is discussed. The statistical scene change detectionapproach is discussed in Section IV. Section V contains simu-lation results. The paper is concluded in Section VI.

II. RELATED WORK

Algorithms for scene change detection can be broadlyclassified as operating on decompressed data, or workingdirectly with the compressed information, using minimally-de-coded data such as macroblock (MB) types, motion vectors,or DC-coefficient images. The original video information is

compressed to preserve storage space and decompression ofthe data, in particular the inverse transform operation (inverseDiscrete Cosine Transform), is very costly. There have beena number of video segmentation approaches that appeared inthe literature [2]–[6], which operate on the decompressed data,in the pixel domain. They may use methods based on colorhistogram differences [2], changes in the edge characteristicsin the image [3], characteristic patterns in the standard devi-ation of pixel intensities for detection of fades [5], or linearmodeling of transitions in gradual changes [4]. The colorhistogram-based methods are the most reliable for detection ofabrupt changes. In the twin comparison algorithm in [2], it isobserved that during a dissolve, successive picture histogramdifferences are greater than the average within-shot histogramdifference, but each of them is not large enough to be classifiedas a change by itself. Thus, two thresholds for change detectioncan be used, one lower and one higher. The higher threshold isintended to detect the abrupt transitions. To detect dissolves, theaccumulated difference over a range of pictures with successivehistogram differences greater than the lower threshold, but notlarge enough to individually exceed the larger threshold, shouldexceed the larger threshold. A good comparative assessmentof some of these approaches can be found in [6] and [7]. Thecomparison in [6] takes into account newer algorithms designedto detect more complex edits (fades and dissolves). The authorfinds that the performance of change detectors based on theedge change ratio (ECR) [3] was inferior to that of specializeddetectors based on color histogram differences [8], standarddeviation of pixel intensities and edge-based contrast. Manydissolves do not show the characteristic edge pattern describedin [3], especially the long dissolves where the ECR change isso slight that it is hidden in noise. An edge based method forcut detection is also presented in [9]. In addition to operatingon decompressed data, a common feature of these approachesis that they work relatively well on abrupt changes or veryspecific transitions such as fades, but fail on the detection ofgeneral gradual changes. Also, analysis of a limited numberof pictures at a given time is detrimental to the algorithm’scapacity to handle the diverse types of gradual changes. Thisis true for algorithms that work in both the decompressed andcompressed domain.

In an approach that can detect both abrupt and gradual scenechanges [10], the authors observe that a video shot boundarycan be construed as a temporal multiresolution edge event in afeature space. The feature space can be obtained by consideringcolor, shape, texture, etc. A color histogram is adopted to rep-resent each video frame. A wavelet technique is applied to thevideo frames in this representation space, and a multiple reso-lution analysis is used to identify both abrupt and gradual scenechanges. Different transitions with different durations are vis-ible at different resolutions. Viewing the wavelet transform asa convolution operation, the lengths of the wavelet filter variesbetween two and 100 frames. The authors report a Recall of98.5%, Precision of 77.9% for video cuts, and a Recall of 97.1%and Precision 65.2% for gradual transitions.

Recently, the emphasis in multimedia processing has beenplaced on algorithms operating in the compressed domain[11]–[16]. Our proposed approach falls into this category.


Scene change detection algorithms operating on compresseddata may use MB and motion type information for detectingtransitions [11], [12], [16]. In the case of gradual changesor special effects, this information may not be sufficientfor detecting the change (especially if only a few picturesare analyzed at a given time). For example, for abrupt cutsoccurring in P pictures, it is expected that macroblocks willbe mostly intracoded since they cannot be predicted well fromthe different reference picture information. Similarly, for a cutappearing in a B picture, it is assumed that macroblocks willbe either intracoded or backward-predicted (to take advantageof a future reference picture that contains similar information).Depending on the difference between the visual informationin the old and new video shot, these conditions may havevarious degrees of reliability. In the case of gradual changes,the problem is compounded by the fact that MBs can remainbidirectionally-predicted in B pictures, or predicted (notintracoded) for P pictures.

Similarly to our approach, at the level of preprocessing for op-eration on minimally-decoded video data, DC coefficients (re-duced DC images) are utilized in [12]–[15] for scene changedetection. In [13], DC coefficients of pictures that are at somedistance from each other temporally are used to construct a com-parison metric based on the sum of absolute differences. In [15]DC histograms are used to detect scene changes. Detection ofdissolves is considered using averages of past histograms. In[12], the differences in the luminance and chrominance DC his-tograms between I pictures are utilized for change detection,along with MB motion type information for the predicted (P,B)pictures. Also, the specific shape of the intensity variance of DCcoefficients in I and P pictures is detected as an indication of aparticular type of gradual transition (dissolve). As with corre-sponding histogram-based algorithms working in the pixel do-main, detection algorithms that use DC histograms averages oraccumulated differences in a time window have a number of dis-advantages. These are related to the fact that gradual changeshave widely-variable length which makes it difficult to set thesize of the time windowa priori, and the algorithms may usemultiple thresholds that must be adapted for each video shot.

In [17], a comparison of metrics used for scene change de-tection is performed. The metrics are applied to two successiveimages and their response to a scene change is recorded. Thesemetrics include chi-square test of histograms, absolute values ofhistogram differences, likelihood ratio, Snedecor’s F-test, tem-plate matching, inner product, and three newly-introduced met-rics. The metrics were tested on 416 frames. There were 340pairs with no changes and 75 pairs with changes. The F-test andthe three new metrics performed the best overall for abrupt cuts,but required more computation time. Using a test sample of sixfades and seven dissolves, it was found that the statistics basedmetrics performed well for this case. In [19], which is based onthe authors earlier work [18], the chi-square test using intensityhistograms of two successive I-frames of an MPEG video is uti-lized for scene change detection. Global, and row and columnhistograms are used in the decision process. Since a gradualtransition spans several frames and can cause a detection de-cision spanning two or more I-frames, the system waits for thenext I-frame (latency) before outputting the decision for the cur-

rent I-frame. The assumption is that the distance between theI-frames is 12. The ground truth for the scene changes includedthe motion-induced transitions.

In [14], DC coefficients representing each image are mappedto a lower-dimensional space using the FastMap transform. Inour approach, the lower-dimensional representation of the DCcoefficients for each picture is based on the Karhunen–Loevetransformation. After the lower dimensional representation ofthe original data we use stochastic modeling and analysis forchange detection. In [14], in the lower dimensional space,the video sequence is represented by a trail of points (calledVideoTrail). By clustering and analyzing the geometrical dis-tribution of points in the space, scene changes are detected assparse trails. The performance and complexity of this methodare dependent on the clustering and geometrical analysis algo-rithms used in a multidimensional space. In [20], a comparisonbetween the authors’ VideoTrails (VT) approach and variousexisting algorithms for scene change detection is presented.There are three versions of the VT algorithm using reducedten-dimensional YUV input data. These are compared to thePlateau [13], VarCurve [12], and Twin Comparison [2] algo-rithms. Natural video simulations were performed, with specialeffects that range from short duration (five frames) to morethan 100 frames in length. With global motion or camera mo-tion alarm removal, the VT-based algorithms performed thebest. The best overall VT-based algorithm produced a Recallof 62.3%, and Precision 69.7%. Another VT version yieldeda higher Recall of 68% with a lower Precision of 57.1%. TheRecall and Precision for the Twin Comparison algorithm were

, .In addition to differences further discussed later, in our sim-

ulations we do not include the gradual motion-related alarmsin the ground truth, unless they are so abrupt that a histogram-based detector is also detecting them. Instead, they are countedas false alarms to account for the worst-case scenario when aglobal motion detector is not available. The algorithms intro-duced in this paper do not utilize a global motion detector, whichis capable of marking these motion-induced scene changes de-pending on the application domain.

Our approach departs from the concept of fixed-size samplechange detection, and successive deterministic or statisticalcomparisons between pairs of frames to detect scene changes.In this paper, we propose an approach for scene changedetection that is based on the more powerful concept of openstatistical tests, which assumes a sequential decision processbased on theoretically the entire past. The sequential aspect ofthe approach is critical for a real-time, one-pass solution wherewe do not have the entire sample (sequence) at our disposal,rather we are receiving one sample at the time. In conjunctionwith the previous considerations, the model used for the videosequences is that of a stochastic process where scene changesare reflected by changes in the characteristics (parameters)of the process. This allows for flexibility in detecting varioustypes of changes without the constraints of imposing a specificmodel of the change (e.g., linear). No predefined comparisondistance is set, such as for examplefor comparing frameand . No limited time window is set and used, extending inthe past and/or future (no latency), in order to make a decision.


Also, with the exception of very abrupt changes (cuts), consid-ering only a few pictures for scene change detection analysisis not reliable. For gradual scene changes, where the videoinformation varies slowly from picture to picture, the statisticalinformation about the change is allowed to accumulate andtrigger the detection. This emphasizes an unified approach todetecting different types of changes having variable length.

III. D IMENSIONALITY REDUCTION AND FEATURE

VECTORREPRESENTATION

The preprocessing phase prior to applying the changedetection consists of dimensionality reduction and a lower-di-mensional representation of the original video data. Thefirst step to reduce dimensionality is to consider only theDC coefficients of each block in the image. For efficiencyand practical purposes, only I- and P-pictures are utilizedby the algorithm. Luminance and chrominance (YUV) DCcoefficients are considered for each macroblock in the image.The DC coefficients are readily available in I-pictures, whichare intracoded. A widely-used method for estimating the DCcoefficients in P-pictures is presented in [21]. Thus, a matrixof DC coefficients—the DC image—is obtained for eachpicture. For the luminance and the chrominance componentsrespectively (YUV), a vector of DC coefficients is formedby lexicographic ordering, and the three resulting vectors aremerged into aDC vectorcorresponding to each picture.

In general, and more so if a real-time constraint is imposedon the process of scene change detection, some degree of di-mensionality reduction of the original DC image data can beperformed. Dimension reduction similar to ours has also beenrecently used by other approaches in the literature, for the pur-pose of scene change detection (see for example [14], wherethe FastMap transform was used to reduce the dimensionalityof the original DC image space to ). In this paper, thetransformation chosen for this purpose is the principal compo-nent analysis (PCA). We model the video sequence, with eachpicture represented by its DC vector, as a stochastic process.The Karhunen–Loeve transformation (KLT) gives the optimalrepresentation of a stochastic process in a finite-dimensionalvector space, in the MSE sense. PCA performs a partial KLT byidentifying the largest eigenvalues and retaining only the cor-responding eigenvectors. The PCA transform retains the mostimportant features of the images. We should note that the scenechange detection approach presented in this paper is not con-strained to work with a specific type of representation of thevideo data. The input can be represented by the sequence oforiginal DC images (vectors), or vectors corresponding to thehistogram bins of the images.

We consider a training set of DC data vectors correspondingto images in the beginning of each video shot for the purpose ofsubspace determination. Letbe an data matrix,whose columns are the DC vectors from the training set.isthe dimension of the DC-coefficient data vectors andis the size of the training set (number of vectors considered).The PCA finds eigenvectors of the estimated correlation matrix

, which is an matrix. The size of matrixis generally very large, which would result in computationallyintensive operations. As described in [22], an efficient approachfor negotiating this problem is to consider the implicit matrix

. The matrix is of size , whichis much smaller than the size of. The largesteigenvalues

, and corresponding eigenvectorsof can be found fromthe largest eigenvalues and eigenvectors ofas follows [22]:

(1)

where , are the corresponding eigenvalues and eigenvectorsof .

The eigenmatrix is formed, having as columns theprincipal eigenvectors . Using the matrix , each original DCvector sample in the current video shot, can be representedusing a reduced-dimensionality feature vector,

, where is the time index. Feature vectors represent theinput to the change detection algorithm.

IV. SCENE CHANGE DETECTION

An extensive literature exists related to the area of statisticalsequential analysis and in particular change detection theory(see [23] and references therein). The video scene changedetection approach presented in this paper is based on the changedetection theory. The dimensionality reduction for DC datavectors corresponding to each I or P picture, is performedas described in Section III. Thus, a sequence of correspondingfeature vectors is obtained. The change detection algorithmoperates sequentially on the feature vectors. The problem ofscene change detection becomes one of detecting changes in theparameters of the probability density function (pdf) associatedwith this vector random process.

A. Additive Modeling

Let us first discuss the case where a scene change is modeledas an additive change in the mean parameter of the pdfassociated with the feature vector sequence that describesthe video data. The reduced dimensionality feature vectors,corresponding to I- and P-pictures, are assumed to form ani.i.d. sequence of -dimensional random vectors havingGaussian distribution , with pdf:

(2)

The scene change is modeled as a change in the vector parameterof the pdf characterizing the feature vector random

process. The parameter before the change, andafter the change. Depending on the amount of informationavailable about the parameter before and after the change, theproblem formulation is correspondingly different. Given therequirement of operating in real-time, in general minimal orno information is available about the parameter after


the change. Let us begin by formulating the problem of testingbetween the following two composite hypothesis:

(3)

where , is the true changetime and . Therefore, the formulation of the hypothesistesting problem reflects the case where there is a known upperbound for and a known lower bound for . The case of in-terest where is assumed to be known, andis assumed com-pletely unknown, can be obtained as a limit case of the solutionto the problem above, as we shall see below.

The generalized likelihood ratio (GLR) [24] algorithm thatgives the solution to the hypothesis testing problem in (3), uses ageneralized likelihood ratio, where the unknown parameters arereplaced by their maximum likelihood (ML) estimates. Thus,for the problem formulation in (3), the generalized likelihoodratio for a sequence of vectors is

(4)

where is the corresponding parameterized probability densityfunction. The sequential GLR algorithm is then

(5)

where is the discrete time index, is the alarm (detection)time, is thetest statistic, and is a threshold. Thus, the teststatistic is based upon the likelihood ratio, which can be easilyproven to be a sufficient statistic.

Under the assumption of an i.i.d. sequence,becomes

(6)

Given the i.i.d. Gaussian assumption, can be written as [see(2)]

(7)

As presented in [24], is obtained:

(8)

where and

(9)

Thus, for the current problem formulation, at each time index, the GLR algorithm in (5) uses as given by (8). The data

that are needed in (9) are the feature vectors, the covariance, and , all variables that can be determined as detailed

later. The expression for admits an iterative computation,allowing for a fast operation of the detection algorithm. Onlyupdates are computed at each time, with the arrival of a newdata vector . As a practical implication, the feature vectorscorresponding to the pictures do not have to be memorized, andno permanent processing buffer is needed.

For the case of interest, the parameter before the change,,is assumed to be known and the parameter after the change isassumed completely unknown but different than. This is aspecial case of the hypothesis testing problem in (3), which canbe written as follows:

(10)

Thus, we let (the limit case) in (3) and (8). Therefore,the GLR algorithm in (5) becomes

(11)

with as in (9). In the development above,is assumed to beknown. In fact, can be estimated using a numberof featurevectors in the beginning of each video shot. The covarianceisestimated using the same set ofvectors. Also, theoretically, ateach , the statistic is constructed based on all data from(the entire past, see (11)). In practice, we can readily consider asufficiently large maximum number of past data points, thatbegins at and ends at .

Let us now summarize the overall strategy for detectingadditive changes in the mean parameter of the pdf describingthe video sequence, for the case of multiple scene changes.The first original DC vectors, , in each shot, areused to determine the subspace representation using the PCA(see Section III). Thus, each original DC data vector

, corresponding to I- and P-pictures in the video sequence,can now be transformed into reduced-dimensionality featurevectors . The first of these vectors in each video shotare used to estimate the covariance, and the mean parameter

. Sequentially, feature vectors are the input to thechange detection algorithm. The change detection algorithmis activated at the th I- or P-picture in a video shot,and it continues operating sequentially until a scene changeis detected. The process described above is repeated for eachnew video shot and is depicted in Fig. 1.


Fig. 1. Block diagram for scene change detection.

The shaded blocks indicate the points where the currentlydiscussed additive model scene change detector differs compu-tationally from the nonadditive detector which will be presentedin Section IV-B. A pseudocode that summarizes the main stepsof the scene change detection algorithm is also provided asfollows.

01: while(!EOF) {/* End of video sequencenot reached */02: bSceneChange FALSE;03: ; /* Reset for a new video shot*/04: while {/* initializationphase */05: get ;06: temp store ;07: if {08: compute PCA, , and ;09: estimate , using10: }11: else12: ;13: }14: while(!bSceneChange) {/* Startingwith in current shot,change detec-tion */15: get ;16: ;17: compute ; /* (11) */18: if19: ;20: else {/* Scene change detected */

21: bSceneChange TRUE;22: alarm(); /* Result: go to state-ment 01, new video shot */23: }24: }25: }

B. Non-Additive Modeling

While maintaining the general model of a video sequence asa vector random process, in this section we investigate the effectof different assumptions about the model and about the nature ofthe changes in the video sequence. As before, we shall assumethat the vector random process is an i.i.d. sequence thatcan be described with a model that uses a parameterized proba-bility density function . The objective is to detect changes inthe vector parameter . For nonadditive changes the like-lihood ratio may become computationally complex. One way toaddress this problem is to consider using theasymptotic localhypotheses approachfor designing the change detection algo-rithm (as in [24]). A brief required background follows.

Let us consider a parametric family of distributionswith , and a sample of size [24]. Assume

there exists a convergent sequence of points indenoted by, such that , where, without loss of generality,

is the vector of change direction with . Let us usethe notation . Then, the two distributions

, , corresponding to the two hypotheses

(12)


get closer to each other when . Let us write the log–like-lihood ratio for the sample as:

(13)

Definition [24]: The parametric family of distributionsis calledlocally asymptotic normal (LAN)if the log-

likelihood ratio for hypotheses and can be written as

(14)where the vector efficient score

(15)

is the Fisher information matrix (covariance of ),and the following asymptotic normality holds:

(16)

The random variable is such that almost surelyunder the probability measure . A corollary of theLAN properties for parametric family satisfying regularityconditions, is that the vector efficient score is an asymp-totic sufficient statistic.

Considering the sequential context and the parameterizedp.d.f. describing the data, a nonadditive change in the vectorparameter is to be detected, under the assumption that theparameter before the changeis known (estimated), and theparameter after the changeis unknown, and different than .In this case, the two hypotheses are formulated in the contextof the local asymptotic assumption, for a change around areference parameter , to . Thus, the two hypotheses are [24]

(17)

where is the change time for the parameter,is the temporalindex, and is small.

The Generalized Likelihood Ratio algorithm is the relevantmethod for solving this hypothesis testing problem. Thus, forthe sequential case, the GLR algorithm can be written as

(18)

The log-likelihood ratio for the sequence of vectors, is

(19)

In the context of the local asymptotic approach, similarly to(14), the second-order Taylor expansion of the log-likelihoodratio around , for the sample is [24]:

(20)

where is the Fisher information matrix, and is the effi-cient score for sample given by

(21)

Using the second-order expansion (20) and the i.i.d assumptionfor the sample of size , we have [24]

(22)

where

(23)

and

(24)

Therefore, the GLR algorithm in (18) that provides the solutionto the change detection problem in (17), can now be written asfollows:

(25)

with as in (23).Under the Gaussian assumption for the vectors :

(26)

In this case, et us consider the parameter monitored for changeto be the covariance . The parameter which is now a ma-trix can be treated more conveniently as a vector ,obtained by lexicographically ordering the columns of, i.e.,stacking the columns on top of each other. Equations (25) areused to compute the test statisticat time , and detect changesin the parameter. As in Section IV-A, let us review the data thatare needed by the GLR algorithm in (25). The parameter(co-variance ) before the change must be estimated. The Fisherinformation matrix is computed (as the covariance of thevector efficient scores, estimated at ). The efficient scoresare computed using (24) and (26)

Let us now consider the detection of multiple scene changesin the video sequence. Similarly to the algorithm presented inSection IV-A, we will use a number of pictures in the begin-ning of a video shot in order to estimate the quantities assumedknown by the algorithm. The first I- or P-pictures ofa video shot are used for estimating the lower-dimensional sub-space that enables the transformation of the original DC vectors

into feature vectors . As shown earlier, the parameterthat is monitored for change detection consists of the covari-ance matrix corresponding to the feature vectors . Theparameter before the change can be estimated using the sameset of feature vectors in the beginning of a video shot,that was used for the subspace determination. This allows thecomputation of the vectors of efficient scores , and theircovariance needed by the change detection algorithm in(25). The same block diagram and pseudocode presented in Sec-tion IV-A are relevant in this case, however, the test statisticis computed differently in the nonadditive context (25).


Fig. 2. Snapshots of special effect scene changes.

V. SIMULATION RESULTS

The video sequences utilized in the simulations were encodedusing the MPEG-2 video compression standard. The size of the

video image was 240 352 pixels. The chrominance format ofthe videos was the 4:2:0 format. The encoding pattern in termsof picture types had 2 B-pictures between P-pictures.

Along with our algorithms for scene change detection, analgorithm based on the idea of color histogram differences


between pictures (see Section II), and operating in the com-pressed domain, was also implemented. Equation (27) used forcomputing histogram-based measures were taken from [12],however they were used differently by applying them sequen-tially only to I- and P-pictures (across larger time intervals):

(27)

The index in the first equation represents one of the color com-ponents (Y,U,V), represents the histogram bin forcolor component in frame , is the number of bins,is the size of a bin, and indexes the I- and P-pictures in thevideo sequence. Thus, at each I- or P-picture, absolute differ-ences between I- or P-pictures histograms bins are summed andnormalized by the histogram bin size, for the Y,U,V compo-nents. This normalized sum, , is divided by the nor-malized sum obtained for the previous I- or P-picture to get aratio value. This ratio is sequentially thresholded and if it ex-ceeds a specified limit, a scene change is declared.

The following abbreviations will be used for the three scenechange detection algorithms—HCD for the histogram-basedchange detector described above, AMCD for the additivemodel change detector, and NAMCD for the nonadditivemodel change detector. The algorithms presented in this paperwere tested on video sequences from news (rapid successionof visual stories on various topics, no news anchor), musicvideos, sports, travel, documentaries (large variety of topicsin a fast-paced manner of presentation). These are the typesof videos where the use of gradual scene changes (specialeffects) is widespread, compared to movie sequences wherethe abrupt changes are predominant. The videos contain a largevariety of changes, and a variable pattern of occurrence ofthese changes. Scene changes occur in fast succession in themajority of portions of the videos. Also, abrupt changes areintermixed with gradual scene changes. There is a significantdegree of camera motion throughout the sequences. The videodata was chosen to provide a large concentration of gradualscene changes in natural video. Thus, of the total number ofscene changes, 45% are gradual. These gradual scene changesinclude a large variety of types. Special edits detected, samplesof which are shown in Fig. 2, include vertical or horizontalband slides, coarse and additive dissolves, white-outs, fades,merging images, diagonal “wipe-out” bar, “swing-in” images,“flip page” effects, warped images, sliding, or rotating images.The common feature of these changes is that they occurgradually and thus cannot be reliably detected by examiningonly a few images at the time, both in terms of image data andmotion vector information. Also, the lengths of the gradualchanges vary widely from four to 120 frames; however, mostare concentrated around 30–50 frames. The diversity of changetypes reflects the real case where scene changes do not obeya specific model (e.g., a linear variation in the intensity). Allin all, the videos represent a difficult test set for scene changedetection (see also sample execution traces provided).

Let us present the relevant parameters that were used in theimplementation of the three algorithms. For the HCD, a number

bins (for a range 0–255) was used for each of the lumi-nance and chrominance components. A threshold wasused. In the case of the AMCD and NAMCD algorithms we havethe following settings. The number of I- or P-pictures aftereach detected scene change, that are used for dimensionality re-duction using the PCA is . In general, eigenvalues (andcorresponding eigenvectors) that are smaller than 1–10% of thelargest eigenvalue can be discarded (thus further reducing thenumber of components in the feature vector belowin somecases). After the detection of a change, a numberof picturescan be discarded in order to allow stabilization of the new videoshot, before acquiring pictures for the new subspace determi-nation. A number of I- and P-pictures was used forthis purpose. The maximum numberof past data points (andimplicitly past I- and P-images) considered for the test statisticcomputation at time (see Section IV) for both algorithms, waschosen to be . The threshold used for the AMCD algo-rithm was . The value of the threshold for the NAMCDalgorithm was . The values of the thresholds in all caseswere chosen based on the assessment of the performance of thealgorithms on different video sequences, and to achieve a bal-ance between false and missed alarms.

Sample execution traces of the histogram based change de-tection algorithm (HCD) and the traces of the test statisticfor our algorithms (AMCD and NAMCD), are shown in Fig. 3.The I- and P-pictures are indexed by the indexon the hori-zontal axes. The corresponding thresholdsare marked in eachinstance by a horizontal line. The traces for each of the three al-gorithms (HCD, AMCD, NAMCD) are presented together foreach of the videos. In terms of marking alarms, all the peaksof the execution traces that exceed the thresholds and are notmarked otherwise represent correct scene change detections thatwere ascertained by visual examination of the video sequence.False alarms are marked with the symbol “” in the graphs.Special types of false alarms (motion-related) are marked bythe symbol “ .” Missed alarms are indicated by vertical linesplaced at the corresponding time index positionand markedwith an “ .” The results of the scene change detection are sum-marized in Table I.1 The numbers in parentheses represent theperformance of the algorithms if a global motion detector wouldinvalidate alarms that are clearly created by simple camera mo-tion (pans, zooms). HCD is not sensitive to camera motion un-less it is very abrupt, as further discussed.

The HCD algorithm is able to detect abrupt changes (cuts)and some special effects that span a small number of pictures.The HCD-detected special effects feature a sufficiently abrupttransition in the grayscale/color level from one video shot to theother. The threshold for the HCD algorithm was chosen to allowit to detect the abrupt changes (with various response magni-tudes) and as many as possible of the special effects, without in-creasing unacceptably the false alarm rate. Decreasing the valueof the threshold further would drastically increase the numberof false alarms. This can be easily ascertained by observing

1The Recall and Precision are related to two other measures used in Detectiontheory, i.e., Detection Rate= Recall, and False Alarm Rate= 1—Precision.


Fig. 3. Sample execution traces for scene change detection algorithms (HCD, AMCD, NAMCD).


the “noisy” low-value portion of the HCD trace for all videos.Increasing the value of the threshold would decrease thenumber of gradual changes detected even further, while still noteliminating many false alarms that are represented by strongresponses (see, for example, the portion – inFig. 3(d)). The false alarms generated by the HCD algorithmand the ones generated by the AMCD and NAMCD algorithmsare caused by different factors. The HCD is sensitive to rapidtransitional changes in the image, for example changes in thecolor structure of the background, or flash photography. Anillustrative example can be seen at – in Fig. 3(d),where a lot of high magnitude false alarms are generated.Once they disappear, these transitional effects leave the sceneunchanged, and therefore they should not be registered asvalid alarms. In the same region [ – in Fig. 3(e)and (f)], the AMCD and NAMCD algorithms exhibit a formof statistical smoothing, by not reacting to changes that donot permanently affect the video shot—a desirable behavior.The HCD algorithm is not sensitive to gradual camera mo-tion and/or objects gradually entering/leaving the scene. TheAMCD and NAMCD algorithms react to camera motion if itcauses a sufficiently large change in the image. The alarmsthat fall into this category are marked with the symbol “”on the traces in Fig. 3. Depending on which course of actionis taken in terms of qualifying these special type of alarms, aglobal motion detection module can eliminate most of thesealarms, or they can be counted as valid alarms (resulting infiner-granularity segmentation). To account for a worst-casescenario, these alarms (symbol “”) were counted asfalsealarms in assessing the performance for our algorithms. Thus,the results of testing the AMCD, NAMCD algorithms representa lower bound on their performance, and they can greatlybenefit from a global motion detection. The missed alarmsfor the AMCD and NAMCD algorithms can be generatedin some cases by an insufficient magnitude of change in themonitored parameter, or by a scene change that occurs earlyfollowing another detection, while the algorithms are in thereinitialization phase.

In the simulations conducted, a value of I- orP-pictures was used. Considering two B pictures in between I-or P-pictures, the time interval in the beginning of each videoshot used for algorithm initialization (training) is approximately1 s at 30 fps. If another scene change occurs early during thistime interval after the current detection, i.e., the current videoshot is less than 1 sec. long, there will likely be a missed de-tection. An example of this type of situation is shown at

in Fig. 3(f). This pattern of occurrence of scenechanges is more characteristic to video commercials. We wouldexpect to have very few situations where video shots are shorterthan 1 s in natural video (commercials obviously may be an ex-ception). If one expects to use such sequences, the algorithmscan be modified to use the B-pictures in the process, i.e., bytaking pictures in the beginning of the video shot re-gardless of type (I, P, B).

The algorithms presented in this paper operate sequentiallyand in one pass. For the AMCD and NAMCD algorithms, de-pending on the nature and severity of the change for a specialedit, the change can be detected with zero delay, or with a delay

TABLE ICHANGE DETECTION PERFORMANCE FORHCD, AMCD, NAMCD

of a number of pictures. This feature is a direct consequence ofthe specific sequential statistical context for change detectionand it reflects the unified approach for detecting both abruptand gradual scene changes. The statistical information aboutthe change is allowed to accumulate and trigger a detection forscene changes that effect gradual modifications in the image.

The differences between the execution times of the HCD,AMCD, NAMCD algorithms and the execution time of theMPEG decoder alone, were also evaluated. The dimensionalityreduction presented needs to perform a PCA only for the first

I- or P-pictures in the beginning of the video shot, in theinitialization phase. Thus, the new representation space isdetermined. Thereafter, each arriving new image in the videoshot is only projected onto the determinedprincipal axes ofthe eigenspace. The GLR algorithm presented is fast, workingwith reduced-dimensionality data and admitting an iterativeimplementation whereby, at time, only an update basedon results at time needs to be made. For simplicity,all execution times were computed with the algorithms indisplay mode, allowing full decoding and display of the videodata, which evidently is not necessary for the three scenechange detection algorithms (needing only minimal decodingof data). The execution time overhead added by the AMCDand NAMCD ranges from 5% to 7% of the execution time ofthe MPEG decoder, while the overhead added by the HCD is12%. The additional processing time of the AMCD, NAMCDdoes not significantly affect the normal execution speed of theMPEG decoder.

VI. CONCLUSION

In this paper, we introduced a new real-time approach to videoscene change detection using the statistical sequential analysistheory, operating on minimally-decoded video bitstreams. Thealgorithms presented offer an unified approach for the detec-tion of both abrupt and gradual scene changes in natural videosequences.

For increased efficiency, the dimensionality of the originalvideo data is reduced using an optimal transformation, thatretains the most representative features of the original data,resulting in a new low-dimensional representation. The changedetection algorithms operate on this transformed low-dimen-sional data. Scene changes are modeled as changes in pa-rameters of a random process describing the video sequence.For video segmentation, additive and nonadditive models of


scene changes are used in the context of statistical sequentialanalysis. Noa priori assumptions are made about the dura-tion of scene changes, or about specific models of the change(e.g., linear). These assumptions would limit the scope of adetection algorithm given the large variety of gradual scenechanges that can be generated. In our approach, the statis-tical characteristics of the video data are allowed to triggera change detection. The detection of variable-length scenechanges is enabled, without the use of predetermined timewindows. Through use of sufficient statistics and recursivecomputation of the algorithms, the limitations imposed by theneed to store past frames in the video shot are eliminated.

The detection algorithms presented in this paper were testedon video sequences encoded with the MPEG-2 video compres-sion standard. Extensions for operation with different block-based transform coding methods are straightforward; e.g., othermembers of the MPEG family, as well as the video-conferencingstandards H.261, H.263, etc.

ACKNOWLEDGMENT

The authors would like to thank Dr. V. Poor, Department ofElectrical Engineering, Princeton University, who pointed us to-ward the extensive literature in the area of statistical changedetection theory, which forms the mathematical foundation onwhich our approach to video segmentation is based.

REFERENCES

[1] D. Schonfeld and D. Lelescu, “VORTEX: Video retrieval and trackingfrom compressed multimedia databases-multiple object tracking fromMPEG-2 bitstream,”J. Vis. Commun. Image Represent., vol. 11, no. 2,pp. 154–182, 2000.

[2] H. J. W. Zhang, A. Kankanhalli, and S. Smoliar, “Automatic partitioningof full-motion video,”Multimedia Syst., vol. 1, no. 1, pp. 10–28, 1993.

[3] R. Zabih, J. Miller, and K. Mai, “A feature-based algorithm for detectingand classifying scene breaks,” inProc. ACM Multimedia, San Francisco,CA, 1995, pp. 189–200.

[4] S. M.-H. Song, T.-H. Kwon, and W. M. Kim, “Detection of gradual scenechanges for parsing of video data,” inProc. Storage and Retreival forStill Image and Video Databases VI, vol. 3312, 1998, pp. 404–413.

[5] R. Lienhart and W. Effelsberg, “On the detection and recognition of tele-vision commercials,” inProc. Int. Conf. Multimedia Computing and Sys-tems, Ottawa, ON, Canada, 1997, pp. 509–516.

[6] R. Lienhart, “Comparison of automatic shot boundary detection algo-rithms,” in Proc. SPIE Conf. Storage and Retrieval for Image and VideoDatabases VII, San Jose, CA, 1999, pp. 290–301.

[7] U. R. Gargi and S. Antani, “Performance characterization and compar-ison of video indexing algorithms,” inProc. IEEE Conf. Computer Vi-sion and Pattern Recognition, Santa Barbara, CA, 1998, pp. 559–565.

[8] J. S. Boreczky and L. A. Rowe, “Comparison of video shot boundarydetection techniques,” inProc. SPIE, Storage and Retrieval Image andVideo Databases IV, 1996, pp. 170–176.

[9] B. Shen, L. Dongge, and I. K. Sethi, “Cut detection via edge extractionin compressed video,” inVisual’97, San Diego, CA, 1997, pp. 149–156.

[10] T.-S. Chua, M. Kankanhalli, and Y. Lin, “A general framework forvideo segmentation based on temporal multi-resolution analysis,” inProc. Int. Workshop on Advanced Image Technology (IWAIT’2000),Fujisawa, Japan, 2000, pp. 119–124.

[11] V. Kobla, D. Doermann, and K. Lin, “Archiving, indexing, and retrievalof video in the compressed domain,” inProc. SPIE Conf. MultimediaStorage and Archiving Systems, 1996, pp. 78–89.

[12] J. Meng, Y. Juan, and S. Chang, “Scene change detection in a MPEGcompressed video sequence,” inProc. SPIE Conf. Digital Video Com-pression: Algorithms and Technologies, San Jose, CA, 1995, pp. 14–25.

[13] B. Yeo and B. Liu, “Unified approach to temporal segmentation of mo-tion JPEG and MPEG video,” inProc. Int. Conf. Multimedia Computingand Systems, 1995, pp. 2–13.

[14] V. Kobla, D. Doermann, and C. Faloutsos, “VideoTrails: Representingand visualizing structure in video sequences,” inProc. ACM MultimediaConf., 1997, pp. 335–346.

[15] K. Shen and J. Delp, “A fast algorithm for video parsing using MPEGcompressed sequences,” inProc. IEEE Conf. Image Processing, 1995,pp. 252–255.

[16] W. A. C. Fernando, C. N. Canagarajah, and D. R. Bull, “Video segmen-tation and classification for content based storage and retrieval usingmotion vectors,” inProc. SPIE Conf. Storage and Retrieval for Imageand Video Databases, vol. 3656, San Jose, CA, 1999, pp. 687–698.

[17] R. M. Ford, C. Robson, D. Temple, and M. Gerlach, “Metrics for scenechange detection in digital video sequences,”ACM Multimedia Syst.,2000.

[18] N. Patel and I. K. Sethi, “Compressed video processing for cut detec-tion,” Proc. IEEE Vision, Image and Signal Processing, vol. 143, no. 3,pp. 315–323, 1996.

[19] , “Video shot detection and characterization for video databases,”Pattern Recognit., vol. 30, no. 3, pp. 583–592, 1997.

[20] V. Kobla, D. DeMenthon, and D. Doermann, “Special effect edit detec-tion using videotrails: A Comparison with existing techniques,” inProc.SPIE Conf. Storage and Retrieval for Image and Video Databases VII,San Jose, CA, 1999, pp. 302–313.

[21] B. Yeo and B. Liu, “On the extraction of DC sequence from MPEGcompressed video,” inProc. IEEE Conf. Image Processing, vol. 2, 1995,pp. 260–263.

[22] H. Murakami and V. Kumar, “Efficient calculation of primary imagesfrom a set of images,”IEEE Trans. Pattern Anal. Machine Intell., vol.PAMI-4, no. 5, pp. 511–515, 1982.

[23] H. V. Poor, “Exponential quickest detection,”Annals Statist., vol. 26,pp. 2179–2205, 1998.

[24] M. Basseville and I. Nikiforov,Detection of Abrupt Changes-Theory andApplications. Englewood Cliffs, NJ: Prentice-Hall, 1993.

Dan Lelescu was born in Lugoj, Romania, on June 3, 1967. He receivedthe M.S. degree in electrical engineering from the Technical University“Politehnica” Timisoara, Romania, in 1991, and the Ph.D. degree in electricalengineering and computer science from the University of Illinois at Chicagoin May 2001.

In December 2000, he joined Compression Science, Inc., Campbell, CA,where he was a Senior Scientist with the Video Group. Since January 2002, hehas been a researcher with DoCoMo Communications Laboratories, San JoseCA. His research interests are in the areas of multimedia signal processing,content-based multimedia analysis and retrieval, video compression andcommunications, and image analysis and computer vision. He is the author ofsix patent applications in the areas of multimedia signal processing and videocommunications.

Dan Schonfeldwas born in Westchester, PA, in June 1964. He received the B.S.degree in electrical engineering and computer science from the University ofCalifornia, Berkeley, and the M.S. and Ph.D. degrees in electrical and computerengineering from The Johns Hopkins University, Baltimore, MD, in 1986, 1988,and 1990, respectively.

In August 1990, he joined the Department of Electrical Engineering andComputer Science, at the University of Illinois at Chicago, where he is cur-rently an Associate Professor in the Departments of Electrical and ComputerEngineering, Computer Science, and Bioengineering, and Co-Director of theMultimedia Communications Laboratory (MCL) and member of the Signaland Image Research Laboratory (SIRL).

He has authored over 60 technical papers in various journals and confer-ences. He has served as consultant and technical standards committee memberin the areas of multimedia compression, storage, retrieval, communications,and networks. His current research interests are in multimedia communicationnetworks; multimedia compression, storage, and retrieval; signal, image, andvideo processing; image analysis and computer vision; pattern recognition andmedical imaging.

Dr. Schonfeld has served as an associate editor for nonlinear filtering of theIEEE TRANSACTIONS ONIMAGE PROCESSINGas well as an associate editor ofmultidimensional signal processing and multimedia signal processing for theIEEE TRANSACTIONS ONSIGNAL PROCESSING. He was a member of the orga-nizing committees of the IEEE International Conference on Image Processingand the IEEE Workshop on Nonlinear Signal and Image Processing. He was theplenary speaker at the INPT/ASME International Conference on Communica-tions, Signals, and Systems.

Documents

Statistical sequential analysis for real-time video scene