Shot Partitioning Based Recognition of TV Commercialsjordi/multimedia tools.pdf · break detection in order to obtain the set of shots that makes the video up. As a good example of

Multimedia Tools and Applications, 18, 233–247, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Shot Partitioning Based Recognitionof TV Commercials

JUAN M. SANCHEZ [email protected] BINEFA [email protected] VITRIA [email protected] Vision Center and Departament d’Informatica, Universitat Autonoma de Barcelona,08193 Bellaterra, Spain

Abstract. Digital video applications exploit the intrinsic structure of video sequences. In order to obtain andrepresent this structure for video annotation and indexing tasks, the main initial step is automatic shot partitioning.This paper analyzes the problem of automatic TV commercials recognition, and a new algorithm for scene breakdetection is then introduced. The structure of each commercial is represented by the set of its key-frames, which areautomatically extracted from the video stream. The particular characteristics of commercials make commonly usedshot boundary detection techniques obtain worse results than with other video content domains. These techniquesare based on individual image features or visual cues, which show significant performance lacks when they areapplied to complex video content domains like commercials. We present a new scene break detection algorithmbased on the combined analysis of edge and color features. Local motion estimation is applied to each edgein a frame, and the continuity of the color around them is then checked in the following frame. By separatelyconsidering both sides of each edge, we rely on the continuous presence of the objects and/or the background ofthe scene during each shot. Experimental results show that this approach outperforms single feature algorithms interms of precision and recall.

Keywords: digital video analysis, video structure, appearance-based recognition, video segmentation

1. Introduction

Digital video is usually seen as a linear stream of audio and visual data. However, thereis a noticeable structure in them that is usually exploited by video analysis applications.At the lower level, a video consists of a set of time-ordered frames, which are groupedfollowing the definition of camera shot. In their turn, higher level shot aggregates areformed by logical relationships between them, usually called scenes or simply logical storyunits (LSU). Whatever structural level we use, the main initial step will be automatic scenebreak detection in order to obtain the set of shots that makes the video up.

As a good example of this kind of applications, this paper analyzes the problem ofautomatic real-time recognition of TV commercials using visual information. A solution tothis problem was proposed by Lienhart et al. in [7]. We propose a different one based on theunderlying shot structure of video sequences. Commercials have several characteristics thatmake them very challenging when it comes to applying automatic digital video analysis.Spot producers make an extensive use of synthetic production effects and other techniquesdue to the large amount of information they want to convey to the viewer within a very stricttime constraint, usually from 10 to 40 seconds. In this sense, information must be understood

234 SANCHEZ, BINEFA AND VITRIA

from the semiotic point of view instead of from the information theory definition. This kindof information is embedded in the audio and visual streams, but must be inferred by theviewer by considering previously established semantic codes. These characteristics makeautomatic tasks like scene break detection to be especially difficult. Moreover, there aretwo main sources of variability that affect commercials recognition:

– The length of commercials is reduced after a short time of being aired. Shorter spots actas reminders of their longer versions, while the broadcast cost to advertisers is reduced.Shorter versions are built from the original set of shots by removing and/or shorteningsome of them.

– Color intensity variations caused by the acquisition of video imagery from differentsources, i.e., broadcasts from different TV stations or digitization using different videodevices. These variations dramatically hurt the performance of appearance-based key-frame recognition techniques.

Our system exploits the structure of videos in order to obtain a compact representationbased on their key-frames. The matching of full commercials is then defined in termsof appearance-based recognition of their key-frames. Therefore, automatic scene breakdetection is used to obtain the basic shot structure of spots.

Unlike other user-assisted applications that may allow detection errors, it is mandatoryfor the algorithm implemented in our application to have very high recall and high precisionrates. Recall is the number of correct detections over the total number of actual transitions inthe sequence. Precision is defined as the number of correct detections over the number of trueand false positive detections. Low recall means missing a large number of actual transitions.In our case, this involves missing key-frames in the representation of commercials, so thatit would be easy to find spots that can not be correctly represented and, thus, recognized. Onthe other hand, low precision means having many false positive detections, i.e., the videowould be oversegmented. In our application, false positive detections involve redundancyin the key-frames representation of commercials that may be harmful in two ways: (1)leading to wrong key-frame matches during recognition, and (2) slowing down the wholerecognition process.

Scene break detection algorithms have been developed by many authors. When workingon compressed video, algorithms based on MPEG and JPEG coefficients are supposedto be the less time consuming. A performance evaluation and comparison of differentalgorithms following this approach was done by Gargi et al. in [4]. This is the less timeconsuming approach when dealing with already compressed video. However, there is aloss of generality, as uncompressed images can not be directly processed without andadditional compression cost, and the performance achieved by them is not as good asby uncompressed imagery algorithms, as reported by Boreczky and Rowe in [1]. In theuncompressed domain, global image descriptors are usually used in order to compute aframe to frame difference measure. The most widely used is the image histogram dueto its simplicity and fast computation speed. Histogram comparison can be done by adistance (L1, Euclidean), histogram intersection [12], χ2 test [8], and so on. Some authorstry to include spatial information from the images by partitioning them into subregionsand computing separate histograms for each one [2]. This and other information is also

SHOT PARTITIONING BASED RECOGNITION OF TV COMMERCIALS 235

present in histogram extensions like color coherence vectors [9] and color correlograms[5].

The performance of pure color-based approaches is enough for many applications, whichdo not depend on the accuracy of shot detections. However, the recall of these algorithms istoo low to be used in applications like our TV commercials recognition system. A feature-based approach was presented by Zabih et al. in [13]. They computed a measure basedon the number of intersecting edge pixels between two consecutive frames after globalmotion compensation, which gives very good recall rates. However, considering onlyglobal traslation may cause many false positive detections, as we will show later in thispaper.

Boreczky and Rowe compared five algorithms in [1], considering different video contentdomains like TV serials, news programs, movies and commercials. We were interested intheir results on the latter, taking into account the requirements imposed by our application.In general, the results reported in that work were not good enough, in terms of precisionand recall. Approaches based on different visual cues usually have different advantages anddrawbacks as well. Therefore, it seems to be a good idea to combine the analysis of differentvisual cues in order to obtain a single more efficient scene break detection measure. Thealgorithm we are presenting in this paper combines the analysis of edge and color featuresin a natural way. The results obtained following this approach show higher recall rates thansingle feature detectors, and very acceptable precision. We also have the advantage that allpossible sources of false positive detections are known.

The rest of the paper is organized as follows. Our TV commercials recognition system ispresented in Section 2. Our new scene break detection algorithm is described in Section 3.Quantitative results are then shown and discussed, and the algorithm is compared with twosingle visual cue algorithms based on color and edges, respectively. The paper is finallyconcluded in Section 4.

2. Recognition of TV commercials

Lots of companies invest large amounts of money in TV publicity. These companies areinterested in checking whether TV stations correctly air their commercials or not. Audio-based systems, with the limitations that spring from this approach, are being used and havebeen patented around the world. We have developed a system that fully relies on visualinformation for the recognition of previously learnt commercials. A representation of thecommercials to be tracked is stored in a database, which is then used to recognize them ina TV broadcast with the real-time constraint.

The recognition of commercials can be affected by length and color intensity variations.The best way to deal with shot-level length variability is to exploit the intrinsic shot structureof videos. The extensive use of synthetic graphics and production effects in commercialscan lead us to think that this structure is lost. However, the basic visual content structure isclearly present, so that the concept of shot is still valid. Furthermore, video producers usuallykeep the classical shot structure even if synthetic images and effects are used. Therefore,our representation of a video segment consists of partitioning it into shots and obtaining adescription of their contents such that recognition techniques can be applied to them.


The visual information contained in the frames of a shot is usually well represented by asingle image due to its high temporal redundancy. This representative image can contain allthe information in the shot, which is the case of panoramic mosaic images [6], or just theinformation in the most salient frame, which is called a key-frame. Mosaicing techniquesare too computationally expensive to be done in real-time, while a criterion to select themost salient frame in a sequence is difficult to define. However, the best key-frames arecommonly found at fixed positions (first, last or middle) of the shot [10]. In the particularcase of commercials, considering that the most salient frame of a shot is found at thebeginning leans on several appreciations:

– When advertisers make shortened versions of their commercials by removing frames,they usually take away frames from the end, thus relying on the viewer’s memory.

– Taking the first frame is costless, and using more complex criteria does not guarantee abetter performance of the system.

– The recognition rates obtained by our system provide an empirical validation of thiscriterion.

Therefore, a commercial is represented by the set of key-frames corresponding to itsshots. In this way, commercials recognition becomes a key-frame matching problem. Theproblem of identifying static images has been widely studied in computer vision. Computinga similarity metric based on straightforward pixel differences is too time consuming whenthere is a large number of key-frames in the database due to the high dimensionality ofdata.1 For this reason, a limited number of image feature descriptors is usually computed.The aim is to define a low-dimensional representation space where different images lie asfar as possible from each other, given a certain distance measure. However, key-frames fromcommercials provide such heterogeneous contents that it is very difficult to find universallygood feature-based descriptors.

Semantic preserving codes were defined by Pentland et al. in [10] as compact represen-tations that preserve essential image similarities. Principal Component Analysis (PCA) isa statistical technique, which provides the directions of the principal axes of a Gaussian-distributed data set. The principal axes are those with the highest variances. If PCA isapplied to all key-frames in our commercials database, we obtain a low-dimensional linearsubspace spanned by the axes with the highest variances. In average, the representations ofimages from the initial set in this subspace are maximally distant in the Euclidean sense.Therefore, PCA is expected to automatically select those image features that are best suitedfor recognition.

The main drawbacks of appearance-based representations, like the one obtained by PCA,is their sensitivity to slight changes of view and color intensity variations. If we take thefirst frame of each shot as its key-frame, changes of view might only be caused by a lackof precision in shot boundary detections. Fortunately, most of the algorithms are precise inthis sense, as they are based on computing frame to frame difference measures.

On the other hand, color intensity variations are a very meaningful source of variability incommercials. A color normalization step must be applied to key-frames prior to obtainingtheir low-dimensional representation. Several normalization algorithms have been devel-oped, and some of them are compared in the scope of appearance-based image matching by


Figure 1. Main steps to be done during the learning stage of the system: segmentation of shots, key-frame extrac-tion, computation of the low-dimensional subspace via PCA and projection of the key-frames in order to obtain afinal compact representation of the commercials. �cX.i is the projection of the i th key-frame of commercial X into thelow-dimensional subspace, and CX is the set of projections of the nX key-frames corresponding to commercial X .

Sanchez and Binefa in [11]. The grayworld approach has been shown to be suitable for purerecognition purposes. It is based on the assumption that the average color in all images isan ideal or canonical gray, following the diagonal model of color correction [3]. The scalefactors for each RGB color channel are Rg/R, Gg/G and Bg/B, where (Rg, Gg, Bg) is thecanonical gray RGB value and (R, G, B) is the average image color.

The system is divided into two stages: (1) learning and (2) recognition. During learning,the low-dimensional representation subspace is computed from the set of key-frames ofthe commercials to be recognized. A reduced representation of these key-frames is thenobtained by projecting them into that subspace. This process is summarized in figure 1. Theuser defines the beginning and the end of each commercial. Fully automatic recognitionis then achieved by detecting shot boundaries in the input video stream, acquiring thecorresponding key-frames, obtaining their reduced representation after color normalizationand looking for matching ones in the commercials database. Matching is defined by theminimum Euclidean distance in the representation subspace.

Heuristics are introduced in the database look up process in order to consider the sequen-ciality of video segments. If we already know which commercial is currently being airedbecause any of its shots has been identified, then the next shot will probably belong to thesame commercial. If it does not, then the commercial has probably finished, and a new onemay be starting. We must also consider that shots can be removed in shorter versions of ourcommercials, even the first one of the sequence. Therefore, the search sequence is as follows:

1. Shots from the current commercial.2. First shot of every other commercial.3. Every other shot of every commercial.

The presence of monochrome frames2 in this representation of commercials can leadto ambiguities. They are used within commercial blocks in order to provide a clear gap


between different commercials, as well as to convey a sense of scene change in movies or incommercials themselves. When a commercial contains a monochrome key-frame, it mightbe wrongly recognized in all those situations. We solve these ambiguities using contextualinformation. A monochrome key-frame that belongs to a commercial will always be airedbetween two shots of the same commercial. Therefore, these key-frames can be removedfrom our representation, so that we only rely on non-monochrome ones. Monochromeframes are characterized by a very low variance of their color distribution.

The system has been tested on Spanish TV broadcasts acquired from different TV sta-tions. We learnt a set of 30 commercials, with 543 key-frames in the database. There were91 occurrences of learnt commercials during the 9 hour test sequence. Although the per-formance of appearance-based recognition after color normalization is not perfect,3 theheuristics introduced in the process let us recognize all occurrences of commercials in thedatabase, even if they were learnt and recognized from different TV station broadcasts.

The main problem here lies in the performance of the underlying scene break detectionalgorithm used to find shot boundaries. Initially, we implemented a scene break detectionalgorithm based on frame to frame color histogram intersection. If precision has to be keptwithin reasonable rates, the recall rate for this algorithm is less than 0.85.4 Due to theirparticular complexity, we can easily come across commercials that can not be correctlyrepresented by its key-frames because shot boundaries are not detected. This is the case forexample, of commercials with smooth gradual transitions, and with cut boundary frameswhose difference does not surpass the specified detection threshold. Many examples ofsuch commercials can be found, and new ones will appear. Therefore, it is very difficult toquantify the number of commercials that can be properly represented using this scene breakdetector. On the other hand, different approaches with better recall are prone to have poorprecision. In this case, oversegmentation produces an overwhelming number of redundantkey-frames that will slow down and reduce the performance of the recognition process. Inour commercials recognition application, we need a scene break detection algorithm withthe highest recall and good precision.

3. Automatic shot partitioning

3.1. Edges and color

Scene break detection algorithms in uncompressed images are mainly based on a singleimage feature or visual cue. We have shown that each visual cue has its strengths and itsweaknesses. It would be nice to take into account different visual cues during the analysis inorder to take advantage of their strengths and mutually conceal their weaknesses. Mergingdifferent measures is a difficult task. For example snake models require the competitionof internal and external energies, which must be combined using expensive minimizationsand parameters that are difficult to adjust. In this section, we present a combined edges andcolor analysis (CECA) for shot boundary detection. As we will show, the analysis of thesetwo visual cues is combined in a natural way and without any additional cost.

The colors around the edges of an image are a very important source of data for visualrecognition. Analyzing image similarities in this way lets us capture the content of the scene,


while a certain variability during time is accepted. Regions around edges have interestingcharacteristics, as they can be seen as two different sub-regions, which belong to differentscene elements or to different parts of them. Moreover, these sub-regions have uniformcolors, so that the analysis of their color content can be interpreted with respect to theelements of the scene. Imagine an object moving over an irregularly colored and texturedbackground. In this situation, the color of the sub-region that belongs to the backgroundmay change from one frame to the next, but the one belonging to the object will remainunchanged. Therefore, the criterion used in our algorithm in order to determine the continuityof a region around an edge is:

Criterion 1. Given a specific region defined by the surroundings of an edge, if the contentof at least one of its sub-regions has not significantly changed from frame i to frame i + 1,then the probability of having continuity in the scene increases.

This criterion is checked for every region by building a color histogram of the pixelsaround the edge. Using a queue based algorithm, the pixels of the sub-regions at a distanceless than or equal to d from their boundary are determined, so that the color histogram isbuilt from the same number of pixels of both sub-regions (figure 2). Suppose that we havea region around edge j in frames i and i + 1, which we call Ri

j and Ri+1j , where the color

in only one of its sub-regions has changed. Their associated color histograms hij and hi+1

jare bimodal and their intersection is the color corresponding to the unchanged sub-region.Given that both sub-regions contribute to the histograms with the same number of pixels,this intersection comprises 50% of the volume of a full histogram (which is normalizedto 1). Thus, Criterion 1 becomes true for regions Ri

j and Ri+1j when:

∑

n

min(hi

j (n), hi+1j (n)

) ≥ 0.5 (1)

3.2. Finding edge matchings between two images

In order to be able to apply Criterion 1, we must find a correspondence between the edges inframes i and i + 1. Edges may have moved and changed of shape due to camera operation.Global motion compensation could be applied like in [13], so that intersecting edges fromthe two images would be assigned to each other. However, correcting global traslations

Figure 2. (a) An image, (b) its edges, (c) the pixels used to build a color histogram around a particular edge, and(d) a 2-D projection of the bimodal color histogram obtained.


Figure 3. Global motion compensation is not enough when there is a camera zoom. There is a zoom out effectbetween the images in (a). When global motion is compensated, their edges (b) do not intersect correctly (c).Local motion compensation (d) works well, but must be followed by a second test. In our case, color continuityis checked. The arrows show the motion estimated for each edge segment.

may not suffice when there are multiple motions in the scene, or when there is a camerazoom, like in figure 3(c). For this reason, our approach consists of breaking all edges intosmaller segments and performing local edge motion estimation. In the example shown infigure 3(d), all edges are moving towards the center of the image due to the zoom out effect.

Therefore, every edge segment is located in the following frame using a correlation-basedsearch within a neighborhood of its edge image. Given a particular edge from frame i , if thereis not a corresponding edge in frame i +1, then we consider a disappearing edge. That is, thisregion of the scene has changed, so that the probability of being a shot boundary increases.This consideration leads us to Criterion 2, which must be applied prior to Criterion 1.

Criterion 2. Given an edge, if a corresponding one can not be found within a neighborhoodin the next frame, then the probability of continuity of the scene decreases.

Given a region Rij , if it does not fit Criterion 2, then it is added to a set called Pi .

Otherwise, Criterion 1 is checked and the region is added to a second set called Qi if it fails(see figure 4).

3.3. Detecting and classifying scene breaks

These are the steps to be followed in order to detect changed regions in a frame with respectto the next one:

– Initial sets of changed regions: Pi = ∅, Qi = ∅.– Edge detection: threshold on the color gradient image (average of the gradient of the

RGB channels).


Figure 4. Region Rij may become a Pi

j or a Qij in the presence of a cut between frame i and frame i + 1 (a).

If no Ri+1j that correlates well with Ri

j is found in a neighborhood (b), it becomes a P-type region. If a feasibleRi+1

j is found (c), its color content is checked and it becomes a Q-type region if it does not match.

– Edge rejection: remove small edges.– Edge partition: divide large edges into smaller pieces.– Region definition: define regions Ri

j , ∀ j .– FOR every j DO

IF a feasible Ri+1j is located THEN

IF∑

n min(hij (n), hi+1

j (n)) < 0.5 THENQi = Qi ∪ {Ri

j }ENDIF

ELSEPi = Pi ∪ {Ri

j }ENDIF

– ENDFOR

In order to obtain a global measure of the scene variation in consecutive frames a ratioof the changed regions with respect to the total regions is computed as:

V1(i) =∑L−1

j=0

∣∣Pij

∣∣ + ∑M−1j=0

∣∣Qij

∣∣∑N−1

j=0

∣∣Rij

∣∣ (2)


Table 1. Contribution of P and Q-type regions to the computation of V1 and V2. + and − stand for high andlow contributions and 0 for no contribution at all.

Contribution to V1 Contribution to V2

Transition effect P-type Q-type P-type Q-type

Cut + + + +Dissolve − − − −Fade-out + 0 0 0

Fade-in 0 0 + 0

where |x | denotes the number of pixels in region x , so that each region contribution isweighed up by the number of pixels it contains, so that variation in large regions is moresignificant than in small ones.

Sharp and gradual transitions are detected and classified using Eq. (2). Cuts are charac-terized by high values of V1(i) with the contribution of both the Qi

j ’s and the Pij ’s, while

they are mainly due to the Pij ’s in dissolves because new edges appear far from the locations

of old edges, as observed in [13], and color variation between consecutive frames is low.Figure 4 shows these contributions in a common cut. In a fade-out, i.e., a gradual transitionof the scene into black, every Ri

j turns out to be a Pij because all the edges disappear, but

a fade-in, which is the opposite transition, can not be detected using V1(i). Since the edgesgradually appear, we can first define the regions Ri+1

j instead and compute an equivalentmeasure V2(i) with respect to the regions in frame i .

We can take advantage of the need to compute V2(i) in order to make the detection morerobust, using the sum of both measures (V (i) = V1(i) + V2(i)) instead of only one of them.Table 1 summarizes the different contributions of P and Q-type regions in the computationof V1 and V2.

3.4. Experimental results

The CECA algorithm has been tested on a video sequence of commercials from a SpanishTV broadcast. The sequence is 11,800 frames long and contains different kinds of shottransitions, which are summarized in the ideal detection column in Table 2. The sequencealso contains plenty of synthetic images, camera operation, multiple object motions, and soon, as commonly found in commercials. Detection results are given in Table 2 in terms ofprecision and recall.

We have compared the results obtained by the CECA with algorithms that only rely on oneof these visual cues, either edges or color. As a color-based algorithm, we have implementedthe widely used frame to frame color histogram difference with respect to their intersection.On the other hand, we have tested an algorithm based on the work by Zabih et al. in [13].Our particular implementation uses the same edge detection strategy that was used in ouralgorithm, and then compensates global motion by finding the maximum correlation posi-tion of edge images. The number of intersecting edge pixels is computed after applying adilation to the motion compensated image, so that small local variations are allowed.


Table 2. Comparative results of different scene break detection algorithms on a 11,800 frames long videosequence.

Ideal CECA HI (th = 0.25) HI (th = 0.3) Edges

Cuts detected 246 246 210 202 234

Fade-in’s detected 12 10 4 4 8

Fade-out’s detected 9 7 3 3 5

Dissolves detected 18 9 0 0 1

False positives 0 45 67 48 203

Precision 1 0.86 0.76 0.81 0.55

Recall 1 0.96 0.77 0.74 0.88

Recall (only cuts) 1 1 0.85 0.82 0.95

The performance of all the algorithms compared in our tests is very poor when appliedto gradual transitions, especially for the simple color histogram detector. Moreover, thenumber of these transitions is relatively low with respect to the number of cuts in the testsequence. For these reasons, we have also considered a recall measure that only takes intoaccount sharp transitions, and not gradual ones.

First of all, results show a great cut detection accuracy of the CECA algorithm, and asignificantly better detection of gradual transitions than with single cue approaches. Fadesare easier to detect than dissolves because image features completely appear or disappear.However, frame-to-frame approaches to shot boundary detection are not appropriate todeal with gradual transitions due to their extremely smooth image variations. A largernumber of frames should be considered like in the twin threshold mechanism by Zhanget al. [14]. Even so, gradual transition detection results obtained using CECA are quitegood.

On the other hand, the number of false positives is kept within reasonable values, i.e.,we will not be overwhelmed by a huge number of redundant key-frames. We have noticedin our tests that false positives given by our algorithm are always due to one of thesefacts:

– Dramatic luminance changes. Camera flashes, explosions, and so on, not only causesudden changes in image colors, but edge detection may also be affected, as shown infigure 5. Therefore, they also affect single feature approaches.

Figure 5. (a) Color is affected by dramatic luminance changes. (b) Edges may be affected as well.


Figure 6. Motion blur makes edge detection difficult. The CECA algorithm is affected by appearing and disap-pearing edges.

Figure 7. The color histograms of images in (a) are very similar, so the cut between them is not detected usinga color based algorithm. However, their edges (b) are significantly different.

– Fast and sudden motion. Large objects may appear and disappear from the scene. Motionblur can also make edge detection difficult, like in figure 6.

These are the main sources of false positives using the color histogram detector as well.However, this algorithm has low recall, considering its application to commercials recog-nition. When a lower threshold is used in order to obtain a better recall rate, then precisiongets worse, as shown in Table 2. The enhanced analysis performed by our algorithm letsus detect shot boundaries that go unnoticed using only color, like in figure 7, without beingless precise.

On the other hand, the purely edge based algorithm reports a quite good recall rate, butprecision is extremely low. When false positive detections are thoroughly analyzed, wenotice that the algorithm is affected by camera operation, other than simple traslations, andby multiple motions in different directions. Figures 3(c) and 8 show that global motion

Figure 8. (a) The face is moving to the right and the background is going to the left. (b) Global motion compen-sation can not fit both of them.


compensation is not suitable in these situations. The local approach used in our algorithmcan handle them properly. A local motion approach is prone to find edge matchings evenin significantly different images because the global structure of the scene is not considered.Therefore, a test that confirms them is needed, and in our case it is based on the color aroundthem.

Regarding its application to our TV commercials recognition system, higher recall in thescene break detector provides higher recall in commercials recognition as well. If more truetransitions are correctly detected, less commercials will be missed. On the other hand, betterprecision of shot segmentation provides higher precision recognizing commercials as well.When many false positive keyframes appear, the probability of having wrong recognitions isincreased. In both senses, the CECA algorithm makes the commercials recognition processto have better performance.

4. Conclusions

Despite its linear nature, digital video applications often try to deal with high-level videostructures like commercials or news items. These structures can be characterized and rec-ognized from the basic shot structure of video sequences, by exploiting prior knowledgeabout the particular video content domain. Specifically, we have shown that commercialscan be recognized from the sets of their key-frames using an appearance-based represen-tation. Key-frames are extracted from shots after an automatic shot boundary detectionprocess, so that the capabilities of the commercials recognition system strongly depend onthe performance of the algorithm used.

New trends in video production tend to make an extensive use of synthetic edit processes,which make automatic digital video analysis techniques much more difficult. Commercialsare the best example, but other video content domains, like news, are approaching thisproduction model, considering their own particular characteristics. In order to exploit thestructure of video under these constraints, reliable shot boundary detectors are required.Common approaches applied to the uncompressed video domain rely on a single imagefeature or visual cue for detecting scene breaks. Our experimental results show that eachone has its own weaknesses that make them inappropriate. The main contribution of thispaper is a scene break detection algorithm that combines the analysis of edge and colorfeatures in a natural way, so that we take advantage of their strengths and they mutuallyhide the weaknesses of each other. Local motion estimation is applied to the edges ofeach frame, and then the continuity of the color around them in the next frame is checked.Experimental results show a very high recall rate with good precision. The sources of falsepositives are limited to dramatic luminance changes and fast sudden motion. Scene breakdetection approaches based on single visual cues are clearly outperformed by the combinedanalysis of edges and color.

Acknowledgments

This work was funded by CICYT grants TEL99-1206-C02-02, TAP98-0631 andTIC98-1100.


Notes

1. A RGB color image of size 160 × 120 pixels can be expressed as a vector in a 57,600-dimensional space.2. This frames are usually black.3. See Sanchez and Binefa [11] for an extensive evaluation of this technique.4. Results shown in Table 2 will be discussed later in this paper.

References

1. J.S. Boreczky and L.A. Rowe, “Comparison of video shot boundary detection techniques,” Journal of Elec-tronic Imaging, Vol. 5, No. 2, pp. 122–128, 1996.

2. C. Colombo, A. Del Bimbo, and P. Pala, “Retrieval of commercials by video semantics,” in Proc. ComputerVision and Pattern Recognition, 1998, pp. 572–577.

3. G. Finlayson, M. Drew, and B. Funt, “Colour constancy: Generalized diagonal transforms suffice,” Journalof the Optical Society of America A, Vol. 11, No. 11, pp. 3011–3020, 1994.

4. U. Gargi, R. Kasturi, and S. Antani, “Performance characterization and comparison of video indexing algo-rithms,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA, 1998,pp. 559–565.

5. J. Huang, S.R. Kumar, M. Mitra, W.-J. Zhu, and R. Zabih, “Image indexing using color correlograms,” inProc. IEEE Computer Vision and Pattern Recognition Conference, CVPR’97, San Juan, Puerto Rico, 1997.

6. M. Irani and P. Anandan, “Video indexing based on mosaic representations,” Proceedings of IEEE, 1998.7. R. Lienhart, C. Kuhmunch, and W. Effelsberg, “On the detection and recognition of television commercials,”

in Proc. IEEE Conf. on Multimedia Computing and Systems, Ottawa, Canada, 1997, pp. 509–516.8. A. Nagasaka and Y. Tanaka, “Automatic video indexing and full-video search for object appearances,” in

Visual Database Systems II, E. Knuth and L. Wegner (Eds.), Elsevier Science Publishers, 1992, pp. 113–127.9. G. Pass and R. Zabih, “Histogram refinement for content-based image retrieval,” in Proc. of the 3rd Workshop

on Applications of Computer Vision, Sarasota, Florida, 1996.10. A. Pentland, R.W. Picard, and S. Sclaroff, “Photobook: Content-based manipulation of image databases,” in

SPIE Storage and Retrieval for Image and Video Databases II, vol. 2185, San Jose, CA, 1994.11. J.M. Sanchez and X. Binefa, “Color normalization for appearance based recognition of video key-frames,” in

Proc. International Conference on Pattern Recognition, Barcelona, Spain, 2000, Vol. 1, pp. 815–818.12. M.J. Swain and D.H. Ballard, “Color indexing,” International Journal of Computer Vision, Vol. 7, No. 1,

pp. 11–32, 1991.13. R. Zabih, J. Miller, and K. Mai, “A feature-based algorithm for detecting and classifying scene breaks,” in

ACM Conference on Multimedia, San Francisco, California, 1995.14. H.J. Zhang, A. Kankanhalli, and S. Smoliar, “Automatic partitioning of video,” Multimedia Systems, Vol. 1,

No. 1, pp. 10–28, 1993.

Juan M. Sanchez is a Ph.D. student in the Computer Science program, Universitat Autonoma de Barcelona,Spain. He received his B.Sc. degree in Computer Engineering from the Universitat Autonoma de Barcelona in


1998, and his M.Sc. degree in Computer Vision from the Computer Vision Center, Barcelona, in 1999. His mainresearch interests are semantic video content representations and retrieval.

Xavier Binefa received the Ph.D. degree in Computer Science from the Universitat Autonoma de Barcelona in1996, becoming associate professor of such university in 1997. His present research interests are in computervision methods for video indexing and retrieval.

Jordi Vitria received the Ph.D. degree from Universitat Autonoma de Barcelona (UAB), Barcelona, Spain, forhis work in mathematical morphology, in 1990. He joined the Computer Science Department of the UAB, wherehe became an Associate Professor in 1991. His research has focused on developing learning systems for objectrecognition. He has also conducted research in medical image analysis and pattern recognition. Currently, he isperforming research at the UAB Computer Vision Center, where he is Director of the Masters in Computer VisionProgram.

Documents

Shot Partitioning Based Recognition of TV Commercialsjordi/multimedia tools.pdf · break detection in order to obtain the set of shots that makes the video up. As a good example of