On-line knowledge- and rule-based video classification system for video indexing and dissemination

Information Systems 27 (2002) 559–586

On-line knowledge- and rule-based video classification systemfor video indexing and dissemination

Wensheng Zhoua,b,*, Son Daoa, C.-C. Jay Kuob

a Information Science Laboratory, HRL Laboratories LLC, Malibu, CA 90265-4799, USAbDepartment of EE—Systems, University of Southern California, Los Angeles, CA 90089, USA

Abstract

Current information and communication technologies provide the infrastructure to transport bits anywhere, but do

not indicate how to easily and precisely access and/or route information at the semantic level. To facilitate intelligent

access to the rich multimedia data over the Internet, we develop an on-line knowledge- and rule-based video

classification system that supports automatic ‘‘indexing’’ and ‘‘filtering’’ based on the semantic concept hierarchy. This

paper investigates the use of video and audio content analysis, feature extraction and clustering techniques for further

video semantic concept classification. A supervised rule-based video classification system is proposed using video

automatic segmentation, annotation and summarization techniques for seamless information browsing and updating.

In the proposed system, a real-time scene-change detection proxy performs an initial video-structuring process by

splitting a video clip into scenes. Motional, visual and audio features are extracted in real-time for every detected scene

by using on-line feature-extraction proxies. Higher semantics are then derived through a joint use of low-level features

along with classification rules in the knowledge base. Classification rules are derived through a supervised learning

process that relies on some representative samples from each semantic category. An indexing and filtering process can

now be built using the semantic concept hierarchy to personalize multimedia data based on users’ interests. In real-time

filtering, multiple video streams are blocked, combined, or sent to certain channels depending on whether or not the

video streams are matched with the user’s profile. We have extensively experimented and evaluated the classification

and filtering techniques using basketball sports video data. In particular, in our experiment, the basketball video

structure is examined and categorized into different classes according to distinct motional, visual and audio

characteristics features by a rule-based classifier. The concept hierarchy describing the motional/visual/audio feature

descriptors and their statistical relationships are reported in this paper along with detailed experimental results using

on-line sports videos. r 2002 Published by Elsevier Science Ltd.

Keywords: Video indexing; Video semantic content; Video classification; Feature extraction; Internet video access; Decision tree; Rule-

based reasoning; Knowledge-based systems; Multicast video; Internet video; Video summarization; Video annotation; Video semantics

inference; Video content filtering

1. Introduction

New integrated multimedia services are emer-ging from the rapid technological advances innetworking, multi-agents, media and broadcasting

*Corresponding author. Information Science Laboratory,

HRL Laboratories LLC., Malibu, CA 90265-4799, USA. Tel.:

+1-310-317-5278.

E-mail addresses: [email protected] (W. Zhou), [email protected]

com (S. Dao), [email protected] (C.-C. Jay Kuo).

0306-4379/02/$ - see front matter r 2002 Published by Elsevier Science Ltd.

PII: S 0 3 0 6 - 4 3 7 9 ( 0 2 ) 0 0 0 1 8 - 2

technologies. The development allows for largeamounts of multimedia information to be distrib-uted and shared on the Internet. Current informa-tion and communication technologies provide theinfrastructure to transport bits anywhere, but thetechnologies do not presume to handle informa-tion at the semantic level due to insufficientindexing mechanisms and lack of good automatedsemantic extraction and interpretation mechan-isms. Consequently the huge amounts of multi-media data impose a heavy burden of datamanipulation on people, including searching/choosing, interpreting, skimming, and integratinginformation.Smart Push and active Pull applications, such as

user agent-driven media selection and filtering,personalized television services and intelligentmedia presentation, follow a paradigm more akinto broadcasting and thereby influence the emergingpattern of multicasting over the Internet. Suchapplications require the ability to analyze and indexcontents on-the-fly rather than by the normal store-and-analyze-later paradigm. Therefore, a distribu-ted proxies architecture coupled with real-timeindexing and content analysis techniques needs tobe developed to support the requirements.Traditional content-based video retrieval devel-

opment focuses on the use of low-level featuressuch as color, motion and texture to index video.However, while direct application of genericsimilarity metrics techniques to low-level featurescan give good results in cases having approxi-mately similar ‘‘patterns’’, their application todiscerning similar semantic classes is highly aspect.This is partly because the effective joint combina-tion of multiple low-level features is a very difficultproblem since it is hard to create generic modelsfor multiple types or applications of video at once.Thus, efficient video classification into semanti-cally meaningful classes will require more super-vised approaches; besides, a static classificationmodel is application dependent and as a result, itmay end up not suitable for on-line applicationsbecause real-time life streams tend to be veryversatile. The difference in our work and theexisting accomplishments in the literature is thatwhile most of them use a static model for videoclassification to provide semantic indexing of off-

line multimedia databases, we have taken anapproach using a supervised learning techniqueto form a classification system and applied itspecifically to basketball video event indexing asan experimental example. Moreover, an effectiveand real-time multimedia data-sharing, filtering,and dissemination infrastructure is proposed usinginternet multicast protocols.More specifically, we apply an inductive deci-

sion-tree learning method to arrive at a set of if–then rules that can be directly applied to a set oflow-level-feature-matching functions. This deci-sion tree is our trained knowledge and forms theon-line classifier. This knowledge-based represen-tation approach is especially useful for on-linevideo semantic classification, indexing and filteringbecause it provides powerful rules that can beeasily associated directly with the characteristicfeatures of each class. Moreover, the rules showthe order and priority of each low-level feature inthe classifier when multiple features are present,which is very important for on-line fast videounderstanding and indexing. The proposed systemalso includes approximate accuracy and confi-dence measures for each classification output.Using this system, we have experimented andclassified basketball video data into differentcategories such as left fastbreak, right fastbreak,left dunk-like events, right dunk-like events, close-up video sequences and so on. Such classificationof high-level semantics can be used to answerqueries such as ‘‘show all dunk-like shots whereteam A scored’’, as well as to support smartbrowsing of basketball games.The rest of the paper is organized as follows.

Section 2 gives a brief description of related work.The background and motivation of this researchare also discussed in this section. The systemarchitecture and the video data model are de-scribed in Section 3, which also details theclassification-rule learning by the decision-treelearning algorithm and the on-line knowledge-based classification system. This section alsodescribes knowledge creation, and video high-levelsemantic content analysis and query/filteringprocedures based on the knowledge informationstored in the knowledge base. Section 4 describesthe implementation of on-line content analysis

W. Zhou et al. / Information Systems 27 (2002) 559–586560

agents based on low-level feature-extraction pro-cesses. Section 5 gives the experimental results forour classifier and video applications over theInternet. The proposed system is evaluated withan application example of on-line basketball eventindexing, and dissemination by filtering. Section 6concludes the paper.

2. Background and related work

2.1. Background and motivation

Generally, the first step in video analysis is sceneand shot boundary detection, which parses videointo a collection of scenes and shots. Each scenecan be represented by a sequence of shots, and theshots can be summarized by a couple of key frames(see Fig. 1(a)). The key frames can be furthersummarized according to low-level features suchas color, shape, and motion. Based on this feature-based analysis, it is possible to perform effectivevideo parsing to support video summarization,fast browsing and low-level content indexing.On the other hand, video programs can be

divided into stories at the semantic level [1], which

are contained at every level of video streams; forexample, video semantic contents are expressedand represented by video scenes, shots and keyframes (see Fig. 1(b) and (c)). Low-level features,such as key frames and objects are widely used forcontent-based retrieval and are relatively easier toextract automatically. Indeed, up to the present,video data are still being annotated manually orsemi-manually with key words in most applica-tions. Automatic video semantic content analysisand extraction remains a very challenging researchtopic. Effective video classification can automati-cally group visual/audio multimedia into a certainlevel of semantic concepts. In addition, a flexibleknowledge representation scheme coupled withreasoning and learning capabilities to bridge thegap between the video low-level features and high-level semantic concepts would facilitate thesemantic level query and filtering for on-line videodissemination.

2.2. Issues in on-line and off-line video analysis

We distinguish between on-line and off-linevideo content analysis because the issues areconsiderably different in the two cases. Automatic

Fig. 1. A framework of video parsing and description: (a) Video parsing and segmentation, (b) Hierarchical video representation (a

description scheme), and (c) Description layers.

W. Zhou et al. / Information Systems 27 (2002) 559–586 561

video content analysis and annotation are gener-ated by processing either directly on the media rawdata or on lower-level annotations of features.Some examples of video content analysis andannotation are topic key words summarization orkey frame summarization of videos, video scene-change tags to break video into visual meaningunits and concept identifiers of a given collabora-tion session, etc.There exist some unique characteristics in on-

line video analysis compared with the off-line case.First, it is necessary to take into account the factthat video data, by nature, tends to be relativelylarger in size as compared to other types of data.This means that real-time processing of video datashould consider low-complexity techniques, suchas fast image processing, so as to facilitate efficientcontent extraction. In other words, to avoidgenerating huge latencies caused by contentanalysis, we may need to trade off the complexityof the content analysis algorithm for increasedspeed, e.g. use the simpler binary classification.Second, semantic extraction directly from

visual data has been traditionally very difficult.Content-based (or feature-based) video retrievalis not efficient due to the lack of a comprehen-sive data model that captures structured ab-stractions and knowledge needed for videoretrieval based on concepts. On the other hand,pixel-matching or feature-similarly matchingmethods employed by query-by-example techni-ques are time consuming, and have a limitedpractical use since little of video object seman-tics is explicitly modeled.Third, not all video content annotations are

generated by directly processing raw data becausethe annotations that are directly generated fromraw data are very low level and not directlycapable of aiding high-level decisions. Typically,further processing has to be on certain metadata(or annotations) to generate the higher-levelannotations which are better suited for aidinghigh-level decisions. For example, video keyframes can be used for fast browsing, but keyframes alone are still very inefficient for semanticinterpretation of video as they are raw images, too.To get the semantic meaning expressed in each keyframe, visual features should first be analyzed and

then pattern classification conducted, based onthese extracted visual features.Another unique characteristic is that many

visual and motional features in video are basedon multiple attributes, or based on other feature-extraction operations, so cooperation and syn-chronization techniques are necessary. Moreover,on-line video normally covers a huge range oftopics which complicates the problem, so that wemay need to learn what is of interest to the userfrom the user’s profile, and then establish theknowledge base for that category of video. Forexample, if we know s/he is a CNN fan, we mayestablish a CNN news knowledge base in off-linebased on the characteristics of CNN news frames,such as the CNN logo, a model of the spatialstructure of anchor-person shots, the stationbackground when the anchor-person is talking,etc. We may then use these characteristics todifferentiate the CNN news video from othervideos. Therefore, we see that in on-line video keyframe and sequence classifications, prior domainknowledge and learning are very important. Andin general, how to establish the base class andfeature basis for a knowledge base from scratchwithin reasonable time duration is one of the mostimportant research issues for on-line video classi-fication. We will address some of the above-mentioned issues in this paper.Apart from the on-line mode, the video content

analysis and annotations are also utilized in theoff-line mode, primarily as a means to aid access tothe raw data based on sophisticated queries suchas those based on semantic content. There aresome differences in the raw data processing used togenerate these off-line annotations compared toon-line annotations, and one of the biggestdifferences between the two is that the formerone does not have the real-time constraint increating the annotations. Thus, the off-line anno-tations might be more sophisticated than thosegenerated using on-line processing. The generationof annotations allows full parsing of a given videorecord to generate context, which is not possible inon-line techniques. Moreover, there is the issue ofindexing records, so as to allow fast access to a setof records, which is again not a consideration inon-line processing. In general, on-line processing


to generate annotations has to consider real-timeaspects, and the annotation is used to aid filtering,whereas in the off-line mode the annotations areprimarily used to allow retrieval from a databaseat a laser stage, and do not have to be concernedwith real-time aspects during metadata extraction.

2.3. Relation work

Today’s video database community widely useslow-level features, such as color, motion andtexture to index videos. Thus, many existing videodatabase management systems content-basedqueries based on low-level features. To name afew, Chang et al. [2] used visual cues to facilitatevideo retrieval, and Deng et al. [3,4] proposed onobject-based video representation to facilitatequeries on the object. However, the low-levelfeatures are generally high-dimensional data anddo not directly map to semantic classes, so thesemethods are still not convenient and efficientenough to support many applications, such ason-line video data searching and filtering based ona user’s requests and profiles.Besides indexing and retrieval with low-level

features, researchers [5–7] have also studied videoclassifications based on low-level features. Efficientvideo classification can bridge the gap between thevideo’s low-level features and its high-level seman-tic features, and thus facilitate the indexing ofvideo databases, video summarization and on-linevideo filtering based on the user’s profile. Forexample, Jaimes and Chang [7] used an interactivelearning algorithm over a semantic data model. Inthis system, they allowed users to specify theclasses and provide examples for learning, and thelearning algorithm was used to train the classifiers.However, the interactive mode is difficult to applyto on-line applications. Picard [8,9] discussed indetail other data models for video and imagelibraries, which are mainly based on the classifica-tion of digital images. This system only applies tooff-line content-based video database retrieval. Asfor the other applications such as real-timeInternet video streaming and on-line video index-ing and filtering, generic models for all thesevideos are hard to establish. It generally requiresfast and efficient content analysis and efficient

semantic classification. Besides, a specific datamodel is not effective for generic purposes becauseof the varied nature of the data. As a result, ageneric, fast and efficient framework for on-linevideo classification still needs to be developed.Many researchers have also worked on various

sports video classification problems. For example,Saur et al. [10] worked on automated analysis andannotation of basketball videos, and they mainlyused heuristics of basketball structure to guide theclassification. Gong et al. [11] and Sudhir et al. [12]both used model-based classification methods.Similarly, most of these video classification sys-tems are feature-based for off-line applications,and few of them study the relationships betweenlow-level features and high-level conceptual mean-ings (semantics). In an off-line video classificationenvironment, classification results can be obtainedbased on all available low-level features. However,this practice is usually not practical for real-timeon-line video classification. Hence a fast videoclassification system based on certain simple, easilyextracted low-level features is essential. It would beeven better if they system could guide featureextraction low-level features is essential. It wouldbe even better if the system could guide featureextraction to avoid extracting unnecessary featuresand thus save the valuable computing power andreduce latency for on-line video content analysis.Knowledge-based techniques [13] are widely

used in the development of vision systems, suchas image and video segmentation [14]. Knowledge-based systems, which are also known as expertsystems, have been traditionally used for the high-level interpretation of images. They incorporatemechanisms for spatial and temporal reasoningthat are characteristic of intermediate and high-level image understanding. However, the knowl-edge-based systems are usually developed forspecific applications, to maintain their efficiency.In our work, we are interested in developing aknowledge-based system for fast on-line videoclassification and filtering applications and ageneric framework and methodology will beestablished first. We use supervised learning toestablish any specific knowledge base while main-taining off-line learning/training as a method toestablish any knowledge-base for new video types.


Then, we use the basketball video classificationand filtering as a particular example to illustratethe concepts proposed.In our system, the knowledge base consists of a

predefined semantic class tree and off-line trainedrules that define the if–then rules for each conceptclass with low-level feature descriptors. Once therules are learned, on-line fast video classificationbecomes possible and efficient. We are not goingto provide a classifier to classify any video’ssemantic meanings; rather for some given type ofvideo such as the video type specified in a user’sprofile list, we provide a generic method to classifyon-line video sequences into semantic units ac-cording to supervised learned rules. Some semanticvideo sequences might share the same featureswith similar features patterns; our goal is to find ageneric way to find the most discernible featuresshared between any two ordered semantic mean-ings in order to distinguish the two, keepingpriority and order in mind.

3. Knowledge-based system for video classification

In this section, we introduce and describe alayered video analysis model for video semanticcontent analysis and conceptual classification. Wewill then describe innovative tools for constructingthe knowledge-based system with flexible rulescovering both semantic content extraction usingsupervised learning techniques, and ad hoc queriesand filtering based on users’ requests.

3.1. Layers of video analysis model for video

semantic classification

To satisfy both on-line and off-line videoanalysis requirements and a remedy the short-comings of traditional content-based databasemanagement techniques, semantic inference (clas-sification) and reasoning for conceptual meaningsbased on low-level perceptual features should beexplored in detail. An example would be detectingall video clips containing dunk-like action in abasketball video by exploiting fast or slow move-ment patterns. To achieve this goal, we describe a

generic layered video analysis model as depicted inFig. 2.The layered video model consists of: (1) the raw

data layer, (2) the video segmentation layer, (3) theperceptual feature content layer, (4) the conceptualcontent layer and (5) the knowledge layer. Eachlayer is mapped to a processing module in the on-line video analysis and dissemination infrastruc-ture, which is discussed in Section 5. The functionof each layer is detailed below:

* The raw data layer: This layer contains originalvideo data, either stored in the video databaseor received from on-line media streams such asvideo, audio and caption text from servers overthe Internet. It is an abstraction of the codedmedia sources, including various formats ofaudio, video, and caption information. Whenmedia is queried or matched with a user’sprofile, audio, video and caption data aremultiplexed, and/or synchronized in transmis-sion and then presented to users.

* The video segmentation layer: Video is acontinuous media and is unconstructed. Tounderstand any content of a video, or toanalyze any video data, an efficient mechanismis required to parse the video into smaller unitsbased on certain criteria and the segmentedunits should be either suitable for perceptualfeature analysis or for conceptual abstractionanalysis. For example, a video is decomposedinto a series of small units with similar visualpatterns so that perceptual features, such ascolor extractions from a key frame can beanalyzed effectively and efficiently, while mo-tion patterns need to be extracted from videoshots or scenes. As for the conceptual meaningof video, we need to decompose the video intounits which correspond to the conceptuallymeaningful abstract of the video content basedon the conceptual model. Our system containsseveral levels of video content analysis proxiesworking on various levels of hierarchical videodecomposition, which will be discussed in detailin Section 4.

* The perceptual feature content layer: Annota-tion tags are represented in this layer. Theannotations are generated by processing either


directly on the raw data or on lower-levelannotations. Thus, the perceptual feature con-tent layer contains all the low-level videofeatures extracted from the raw data, includingphysical video parsing tags and feature descrip-tors such as color, texture and motion. Theextraction methods for the generic low-levelfeatures used for system evaluation with thebasketball video will be described in detail inSection 4. Correspondingly, our system con-tains several levels of video low-level analysisand extraction proxies working on variousmedia source, including video, audio and text.

* The conceptual content layer: Visual data in avideo clip contain rich and unconstructedinformation. The video conceptual contentlayer is an abstraction of various visual datasemantic types and their structures. In oursystem, the video conceptual content is mappedto the user’s and/or application’s model, whichthe query engine and user’s profile will follow.For example, in the basketball video, mostpeople may query certain key game events, such

as scores and dunks. The conceptual contentlayer for the basketball video is defined to havenine events as given in Section 5, the conceptualcontent of the visual data can be furtherexpanded as needed, depending upon the user’sinterests and requests.

* The knowledge layer: The knowledge layercontains rules to map low-level features in theperceptual feature content layer from eachvideo clip into classes of the conceptual contentlayer. The knowledge rules are automaticallyderived from off-line learning algorithms andconstructed as a tree structure. The featureattributes used for video classification aregeneral and insensitive to the context. Thebuild-up of knowledge will be described indetail in Section 3.2. Visual entities in theperceptual feature content layer are linked tothe knowledge tree in the knowledge contentlayer to provide present values for conceptualterms. The knowledge content layer to providepreset values for conceptual terms. The knowl-edge will be used for concept inferencing to

Fig. 2. Layers in the video analysis model.


support further data dissemination decision-making. Moreover, the query engine can use theinference engine to automatically generatefeature compositions for content-based retrievalwith low-level features. High-level semanticcontent analysis proxies, such as the videofeature clustering proxy and the video featureclassification proxy, analyze and identify mediaconcepts (such as the subject of a news video orthe topic of a story) exhibited by the datastreams.

The above conceptual layers are derived by novelfeature extraction and content analysis techniques,and are used for on-line media stream filtering andfor ad hoc querying based on user interests.Archiving and further off-line analysis of the datacan also be performed to provide additionalsemantic structures for subsequent retrievals.

3.2. Building the knowledge base

In order to classify incoming video streams intomeaningful semantic classes, we should classifyvideo with a small number of features that can beeasily extracted. Fig. 3 illustrates the overallsystem for knowledge base building and the on-line video classification process. It consists of twosteps. First, it performs the off-line training.

Sample video clips of different categories are firstidentified, and appropriate low-level features arecreated. Second, we utilize an entropy-basedinductive tree-learning algorithm [15] to establishthe trained knowledge base. Rules are learned bytraining, which includes ways to determinedcharacteristic features for each class. The classifi-cation rules for each class contain the functions ofcertain feature descriptors and their correspondingthreshold values, order and weight. Off-linetraining can also update the existing knowledge,if necessary. Once we have the knowledge of theclassification rules for each class, they are used tobuild and guide on-line feature extraction inresponse to filtering specified by users, i.e. choosethe appropriate operators to get target featuredescriptors, and output the binary classificationresult (yes/no) to user’s specification. The rules canspecify the order, the value and the priority of eachfeature test. The above process will be described indetail below.

3.2.1. Rules for the knowledge base

Our knowledge base is represented as a decisiontree, as shown in Fig. 4, where each node in thetree is an if–then rule applied to a similarity metricbased on appropriate low-level features along withwell-derived thresholds. The rule is depicted as

Fig. 3. Knowledge-based video classification system.


f ¼ /F ; yS; where F denotes the appropriatefeature and y denotes the threshold which isautomatically created during the training process.Semantic categories form leaves of the decisiontree. Each node in the tree is either a leaf or anintermediate node with two children. A set ofvideos is associated with each node Ni; while adecision rule fi is associated with each intermediatenode.An example of a rule-based tree with nine noes

is shown in Fig. 4. The entire set of videos isassociated with the root. Let NI be an intermediatenode, with its children labeled as Ni1 and Ni2: Thenthe video subsets VI ; Vi1; Vi2 satisfy

Vi ¼ Vi1,Vi2;Vi1-Vi2 ¼ 0: ð3:1Þ

In other words, the leaves form a partition of thedatabase into disjoint subsets. In the exampleabove, sets V3; V5; V6; V7 and V8 are disjoint, andtheir union is V0: Without loss of generality, it isassumed that the decision rule fI is a discriminatefunction with the following interpretation. Let x

denote a video clip in Vi: If fiðx;FiÞpyi thenxAVi1: If fiðx;FiÞ > yi then xAVi2: When thesediscriminant functions represent important visualcharacteristics, a visual-content rule tree partitionsthe set of videos into distinct clusters with feature-similar video clips in each cluster. In the exampleabove, all images in V7 have the followingproperties in common:

F0ðx;F0Þpy0; f1ðx;F1Þ > y1; f4ðx;F4Þpy4: ð3:2Þ

They should look different from the video set in V2

if the characterization according to the rule off0ðx;F0Þ is visually meaningful. In order words, weneed at most rules to classify a video into class N7:

3.2.2. Supervised decision-tree learning and rules

extraction

The classification scheme uses the rule-basedsystem to build a binary tree and associates treenodes with subclasses of videos. This enables fastvideo classification. The key computational step isto create the two children of a node so theirassociated classes are clustered with respect to ameaningful visual characteristic. This implies thatvideo subsets associated with each one of the twochild nodes are more alike with respect to thevisual property then the video clips associated withthe parent node. If the tree is deep enough, itsleaves should correspond to clusters of similarvideo events, and a query by visual-content orclassification based on visual features shouldreturn one or more of these video clusters.The decision tree [16] is one of the most widely

used and practical methods used to generate rulesin inductive inference. Given a collection of S thatcontains a total of c categories of some targetconcepts, the Entropy of S relative to this c-wiseclassification is defined as

EntropyðSÞ ¼Xc

i¼1

pi log2 pi; ð3:3Þ

where PI is the proportion of S belonging to classi: The entropy is equal to 0, which is the minimum,when all cases in a set belong to the same class.The entropy is equal to 1, which it the maximum,when each class is equally distributed in the givenset. The Information Gain is simply the expectedreduction in the entropy caused by partitioningexamples according to an attribute. More pre-cisely, the information in Gain Gain(S,A) of anattribute A relative to a collection of examples S isdefined as

GainðS;AÞ ¼EntropyðSÞ �X

vAValuesðAÞ

jSvjjSj

EntropyðSvÞ:

ð3:4Þ

Fig. 4. Illustration of rule-based tree for classification.


where ValueðAÞ is the Set of all possible values forattribute A; and Sv is the subset of S for whichattribute A has value v (i.e. Sv ¼ fsASjAðsÞ ¼ vg).Note the first term in Eq. (3.4) is just the entropyof the original collection S; and the second term isthe expected vale of the entropy after S ispartitioned using attribute A: GainðS;AÞ is theinformation provided about the target functionvalue, given the value of some other attribute A:The value of GainðS;AÞ is the number of bits savedwhen encoding the target value of an arbitrary ofS; by knowing the value of attribute A: Todetermine the first attribute to be tested in thetree, the algorithm determined the Information

Gain for each candidate attribute (all featureattributes used in the training process), then selectsthe one with highest information gain. The processof selecting a new attribute and partitioning thetraining examples is now repeated for each non-terminal descendant node, which uses the trainingexamples associated with that node only. Attri-butes that have been incorporated higher in thetree are excluded, so that any give attribute canappear at most once along any path through thetree. This process continues for each new leaf nodeuntil either of these two conditions is met: (1) everyattribute has already been included along this paththrough the tree, or (2) the training examplesassociated with this leaf node all have the sametarget attribute value (i.e. their entropy is zero). Insummary, the decision-tree learning procedure hasthe following steps:

* At each step, we split the tree upon the variablethat maximizes the entropy gain. If S containsone or more tuples labeled by CI and thedecision tree is a leaf identifying class Ci; Stop.

* Otherwise, S contains tuples with mixed classi-fication. We split S into S1;S2;y;Sm that‘‘tend to belong’’ to the same class. The splitis executed according to possible outcomesfO1;O2;y;Omg of a certain feature attributerk: Thus, SI contains all r in S such that r k ¼Oi: In this case, the tree for S is a node with m

children. The node is labeled with featureattribute rk and function f ¼ /rk;OiS:

* Perform the above two steps recursively foreach S1;S2;y;Sm:

The whole tree induction depends upon the splitcriterion and the stop criterion. A good tree shouldhave few levels as it is better to classify with as fewdecisions as possible; in addition, a good treeshould have a large leaf population as it is better tohave leaves with as many cases as possible. In thelearning algorithm, we split the training data uponthe variable that maximizes the GainðS;AÞ; here,the ‘‘gain’’ value is only based on the classdistribution, which makes the computation easyto perform, and the ‘‘Entropy gain measure’’ doesnot take popularity into consideration. If the stopcriterion was chosen when the entropy was 0 (sameclass for all cases), then it will cause over-fittingand yield to very deep trees with few cases on theleaf nodes, which is undesirable. So it is possible tochoose the stop criterion either at a minimumpopularity allowed for the leaves, or at a certainentropy value to be reached, or even better bycombining the previous two conditions. Usuallythe algorithm will do tree over-fit, first, and thendo pruning.In practice, one quite successful method for

finding high-accuracy hypotheses is a techniquecalled rule post-pruning. A variant of this pruningmethod is used by C4.5 [15]. Rule post-pruninginvolves the following steps:

1. Infer the decision tree from the training set,growing the tree until the training data is fit aswell as possible and allowing over fitting tooccur.

2. Convert the learned tree into an equivalent setof rules by creating one rule for each path fromthe root node to a leaf node.

3. Prune (generalize) each rule by removing anypreconditions that result in improving itsestimated accuracy.

4. Sort the pruned rules by their estimatedaccuracy, and consider them in this sequencewhen classifying subsequent instances.

In rule post-pruning, one rule is generated for eachleaf node in the tree. Each attribute test along thepath from the root to the leaf becomes a ruleantecedent (precondition) and the classification atthe leaf node becomes a rule consequent (post-condition). Next, each such rule is pruned byremoving any antecedent, or precondition, whose


removal does not worsen its estimated accuracy.Rule post-pruning would select whichever of theabove pruning steps produces the greatest im-provement in estimated rule accuracy, then con-sider pruning the second precondition as a furtherpruning step. No pruning step is performed if itreduces the estimated rule accuracy.The 4.5 algorithm [15] evaluates the perfor-

mance based on the training set itself, using apessimistic estimate to make up for the fact thatthe training data gives an estimate biased in favorof the rules. More specifically, C4.5 calculates itspessimistic estimate by calculating the standarddeviation in this estimated accuracy assuming abinomial distribution. For a given confidence level,the lower-bound estimate is then taken as themeasure of rule performance. For large data sets,this pessimistic estimate is very close to theobserved accuracy.The rules can be generally expressed as follows:

IF ðfeature 1 ¼ value 1Þ and ðfeature 2 ¼

value 2Þ

THEN Class concept ¼ class 1:

The major advantages of converting a decision treeto rules are:

* Converting to rules allows distinguishingamong the different contexts in which a decisionnode is used.

* Converting to Rules removes the distinctionbetween attribute tests that occur near the rootof the tree and those that occur near the leaves.

* Converting to rules improves the readability.Rules are often easier for people to understand.

* Rules are easier to incorporate with a knowl-edge base for further intelligent inferring andreasoning.

3.3. Video classification procedure for query and

filtering

Learned decision trees are constructed with atop-down approach, beginning with the question‘‘Which attribute should be tested at the root ofthe tree?’’ The central points of the videoclassification algorithm are ‘‘Which attribute isthe most useful in classifying examples?’’ and

‘‘What is a good quantitative measure of the worthof an attribute?’’ The goal of the algorithm is tofind a value which can measure how well a givenattribute separates training examples according totheir target classification. The properties describedabove are very important in video classificationbecause, since there are so many possible featuresto be used as keys to query video/image databases,each feature is not always the best for all queriesunder different applications. To solve the featureindexing and classification problem efficiently, thestudy of feature effectiveness for a certain classi-fication application is essential.The rule tree provides the optimal procedure to

find a value that can measure how well a givenattribute separates training examples according totheir target classification. A new video clip is thenclassified as follows. Following the tree, the featureto be utilized in Level 1 (the root level) test is firstextracted and the corresponding rule is applied.The result of this first test dictates the next featureselection, extraction and test to be followed, andthe same procedure will be repeated at each leveluntil the leaf node is reached. In this system, onlythose features that are relevant are extracted andthey are matched with the rule threshold directly.Further processing, such as data indexing, will bemade right after the classification is done.This system of knowledge-based video classifi-

cation/inference processing is depicted in Fig. 5. Itconsists of the following three major steps:

* Preprocessing for feature and feature-extraction

section: Based on the target video specified in auser’s profile or query, the system will search fornodes in the rule tree of the knowledge base bytraversing up and down tree structure. If aconcept key work is identified in the videosemantic hierarchy tree, relevant visual/audiofeatures and their thresholds will be selected. Atthe same time, the feature-extraction operatorsfor the corresponding feature descriptor will bedecided.

* Knowledge-based content matching: Once eachfeature-extraction operator obtains the featurevalue, it is compared with rule. If it matches,then the next feature extraction and matchingoperators are selected. The procedure is


continued until the leaf node in the binary treeis reached.

* Video dissemination based on video classification:The raw video data, including audio and texts,are then filtered and disseminated to the endusers depending upon the concept matchingbetween the incoming real-time media streamsand the users’ requests or profiles.

4. On-line video content extraction and analysis

This section describes the flow chart of videocontent analysis for on-line video indexing andfiltering based on video semantic conceptualcontent (see Fig. 6). One key technical componentintegrated into this system is the decomposition ofunconstructed and continuous video streams intostructured units with both perceptual and con-ceptual meanings, which are composed of threemajor components.Figs. 1 and 7 depict schematic analyses of the

video structure and the relationship between videolow-level perceptual features and high-level con-ceptual contents. At first, video data needs to beparsed either based on visual criteria or logic

criteria from the top down (see Arrow 1 in Fig. 7).Video is analyzed by segmentation into shots.Then shots can be defined as a set of contiguousframes which depict the same scene or signify asingle camera operation or contain a distinct eventor action like a significant present or persistence ofan object [17]. Scene changes have to be detectedwhen segmenting the video initially. In the mean-time, the low-level perceptual features are beingextracted for two purposes: one is to serve ascontent-based indexing to support query andretrieval based on low-level feature similarities;the other is to be used to infer the video’s high-level semantic meanings based on the patternsshown in each video concept category. The videosegmentation process is followed by shot analysisin order to obtain the final structured video thatcontains link relations between different shots aswell as content features for different shots. Whileperceptual features need to be captured bottom up(see Arrow 2 in Fig. 7), the video conceptualmeanings tend to be associated with larger blocksof video sequences and should be analyzed topdown (see Arrow 3 in Fig. 7). Based on theproposed layered video analysis model and theobservations given above, hierarchical multi-level

Fig. 5. The flow chart of knowledge-based video classification for query and filtering.


video segmentation and content analysis schemeshave been successfully implemented and evaluatedin our research work.We extract low-level video features and knowl-

edge using a bottom-up approach (Arrow 1 inFig. 7). In this approach, a video is parsed intoscenes and the key frames of the video scenes areextracted to represent and summarize the wholevideo. Then, the key frame images are analyzed byusing image processing methods to obtain featuressuch as colors, textures and shapes. Besides,motion and audio features are also analyzed.These low-level features can be used to index

video databases or to link key frames. The mainadvantage of using low-level features in videodatabase indexing is that database organizationcan be completely automated. At the same time, itsmain disadvantages it that it is very difficult to findthe description of an image/video, which is close tothe semantic description of its contents. However,conceptually similar video clips generally sharecommon perceptual patterns, and this provides thefoundation of video classification for conceptualmeanings. Here, we used a supervised learningmethod to generate functions to bridge the two.High-level video meanings are analyzed by

classifying video contents with a top-down ap-proach (see Arrow 2 in Fig. 7) as described inprevious sections. Since a video may have a widerange of contents, effective classification must bedone in a hierarchical way. The advantage of usinga semantic description of video is that it is possibleto give very detailed and semantically precisedescriptions of an image/video including termswhich are unlikely to be determined by using videoor image processing techniques alone. The dis-advantage of this technique is that as yet there areno efficient and fast methods for fast video/imageclassification. To overcome this obstacle and toorganize the multimedia database more suitablyfor human use, we propose a rule-based classifica-tion system by supervised learning, and a hier-archical video conceptual model specified in terms

Fig. 6. Flow chart of on-line video content analysis for multimedia information access.

Fig. 7. Video feature-extraction and content analysis structure.


of the video concept tree which Fig. 8 shows anexample. Furthermore, novel machine learningtools are developed to establish relations betweenthe low-level perceptual features and high-levelconceptual video contents. The rule-based knowl-edge representation is unique but general enoughto be used on a variety of problems in differentapplication domains.

4.1. Real-time scene-change detection and key

frame extraction

Scene-change detection is very important, sinceit is the first step and the most fundamentalelement for video analysis. We have developed acontent proxy that implements a scheme forsummarizing real-time video in terms of a fewrepresentative key frames from video sequences bythe scene-change detection process. The processingscheme is shown in Fig. 9. The main component ofthis scheme is based on real-time scene-changedetection on RTP intra-H.261 compressed videostreams that are commonly used in MBonebroadcasts. A video is split into meaningful scenessuch that each image frame corresponds to adifferent shot detected out as a key one based on aparticular criterion. Fig. 9 shows a scene-changedetection based on histogram comparisons be-tween adjacent frames of a video stream. There-

fore, a video clip can then be summarized via keyframes extracted from different scenes that makeup the whole stream. We take into accountchanges in both luminance and chrominancevalues in making the decision. If the change inthe luminance or chrominance histogram oversuccessive frames is larger than the threshold, wecategorize that frame as a scene-change frame. Asthe result, a key frame is generated from thestream and is sent out as a scene in a multicasechannel. To avoid unclear images caused byediting effects such as dissolve, we choose theframe that sits at the end of the first 10% of videosequences right after a new scene has beendetected, as the key frame of the scene. Thescene-change tag and key frame can also beinserted into a multicase channel so that otherrelevant proxies that are attempting to do otherkinds of processing on the network can utilize thisextracted information.We have developed a family of algorithms,

including one that utilizes full decompression toget full frames and a partial decompression to getinformation on the changed blocks so as toestimate the extent to which the full frame haschanged since the last frame. In our earlierinvestigations, joint algorithms based on videocodec characteristics were carried out to acquirefast and accurate scene detection. Experimentresults shows that our algorithms are capable ofsupporting real-time video processing and satisfy-ing on-line annotation needs [18,19] given differentnetworks and data characteristics. We also adoptthe scene-change detection algorithm in thecompressed domain proposed by Yeo and Liu[20] for MPEG-1 video. A good survey oftechnologies for parsing digital video was givenby Ahangera and Little [21].

4.2. Video events segmentation with visual and

audio cues

Our segmentation process for sports events isshown in Fig. 10. To segment sports into logicalunits such as events, we employed a heuristic rulethat can detect sports event boundary by identify-ing the sound of a whistle or the change associatedwith a new scene. Since a whistle sound from a

Fig. 8. Example of hierarchical concept tree.


referee usually indicates the start or end of anevent at games, we treat it as the logic boundarybreak for a new event even if there is no scenechange. This audio feature analysis and whistlesound detection will be discussed in the followingsubsections.

4.3. Feature extraction and analysis for audio cues

Generally, audio features are divided into twodistinct categories, time and frequency domain. Toextract both domain features, we first sampleaudio signals at 11 o25Hz with 16 bits/sample.

Then for each of the audio clips corresponding to adistinct visual scene, we calculate audio features asfollows.

4.3.1. Time-domain features

In the time domain, we calculate statisticalparameters (such as mean, standard deviation,and dynamic range, etc.) of the trajectories ofshort-time audio volume and zero-crossing rate foreach audio clip, Non-silence ratios are alsocalculated based on both short-time volume andzero-crossing rate. The short-time audio volume

Fig. 9. On-line scene-change detection for H.261 video.

Fig. 10. Sports event segmentation flow chart.


and the zero-crossing rate are defined in Eqs. (4.1)and (4.2), respectively.

Vn ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

N

Xm½xðmÞwðn � mÞ2;

rð4:1Þ

Zn ¼1

2

Xjsgn½xðmÞ � sgn½xðm � 1Þjwðn � mÞ;

ð4:2Þ

where for both equations,

wðnÞ ¼ 1 when 0pnpN � 1 otherwisewðnÞ ¼ 0:

And for Eq. (4.2) only

sgn½xðmÞ ¼ 1 when xðmÞX0 while sgn½xðmÞ

¼ �1 when xðmÞo0:

In both the above equations, xðmÞ is the discretetime audio signal with index of m; n is the timeindex of the short audio frame whose size isspecified by a rectangular window of wðnÞ withwindow length N: Here, we choose the frame sizeof N ¼ 150 samples (i.e. the audio frame is about15ms long) and calculate both features once every100 samples (about 10ms apart) in the audio clips.The statistical parameters of the above twofeatures, such as mean and variance, are calculatedbased on index n: Since the dynamic ranges ofthese statistical features differ a lot, we normalizethem by their maximum volume and maximumzero-crossing rate, respectively, for each audioclop. To detect the silence ratio in each clip, wecompare the volume and the zero-crossing ratewith a certain threshold for each. It is claimed thatthe audio frame is silent when both its volume andzero-crossing rate with a certain threshold foreach. It is claimed that the audio frame is silentwhen both its volume and zero-crossing rate aresmaller than each of its thresholds. Thus, the non-zero ratios of Vn and Zn are the percentage of non-silent audio frames over the whole audio clip.

4.3.2. Frequency-domain features

In the frequency domain, we first calculate thespectrum of an audio clip by using a direct FFTtransform with a 512-point FFT size whichgenerates a 2-D plot of the short-time Fouriertransform (over each audio frame) with frequency

as the X-axis and the amplitude in db as the Y-axisfor all audio frames over the time domain. Wecompute the following features for each frame andtheir distributions over the entire audio clip:

* Short-time fundamental frequency (FuF): TheFuF is defined as follows. When the sound isharmonic, the FuF value is equal to thefundamental frequency estimated from theaudio signal; and when the sound is non-harmonic, the Fuf is set to zero. We calculateeach frame’s FuF as described by Zhang andKuo [22].

* FuF distribution for the whole audio clip:Statistical parameters, such as mean andvariance, are computed for the trajectory ofthe FuF of each audio frame over the timethrough the entire audio clip.

* Centroid frequency and bandwidth: Similar tothe work of Wold et al. [23], the frequencycentroid, CðiÞ; and bandwidth BðiÞ; of eachaudio frame are defined as

CðiÞ ¼

RN

0 ojSiðoÞj2 doRN

0 jSiðoÞj2 do; ð4:3Þ

bðiÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiRN

0 ðo� CðiÞÞ2jSiðoÞj2 doRN

0 jSiðoÞj2 do

s; ð4:4Þ

where SiðoÞ represents the short-time Fouriertransform of the ith frame. Using Eqs. (4.3) and(4.4), we compute the centroid and the bandwidthfor every audio frame over the entire audio clip,thus generating 3-D plots of the centroids and thebandwidths of audio clips along the time axis. Themean and the standard deviation of both thecentroid and the bandwidth of an audio clip areused as four frequency domain features:

* Energy ratio of some sensitive sub-bands. Theenergy distribution in different frequency bandsvaries quite significantly among different audiosignals. For example, the spectral peak tracks inspeech normally lie in the lower frequencybands, ranging from 100–300Hz; while whis-tles, which are often heard in sports videos,have high frequencies and strong spectrumenergy with frequencies ranging from 3500–


4500Hz. To differentiate special audio events,like speech, whistles or noise, we calculateenergy ratios in the sub-bands [0–400Hz],[400–1720Hz], [1800–3500Hz], [3500–4500Hz]with respect to the overall energy for all framesin the audio clip.

* Peak of spectrum on each audio frame and peakdistribution of the entire audio clip. The peaktrack in the spectrum of an audio signal oftenprovides us with some characteristics propertyof the sound. In the sports video games, onetypical sound is a referee’s whistle, which oftenoccurs right after fouls in basketball and soccer,or at the beginning of a serve in volleyball, etc.Whistles in sports videos often last at least 1 s,and have stronger energy then speech andmusic. We link whistle detection to both videosemantic boundary detection and semanticmeaning inference. Peaks of whistle spectrumsnormally range from 3500 to 4500Hz. Wedetect the most prominent frequency fromFFT transformed spectrums for every frame inan audio clip so as to get a 2-D graph with thetime domain on the X-axis. It is claimed that asound of whistle is detected if there is a longerthan 1 s window of peak frequencies which fallinto the range between 3500 and 4500Hz, asshown in Fig. 11.

4.4. Motion feature extraction and analysis

Motion information is a good cue to use invideo [24], as it is an integral part of a motionsequence. In addition, motion is typically alreadycalculated in most video codecs, and motioncompensation information is available in thecompressed data stream. In MPEG-1 video, thereare three types of frames, I frame, B frame and Pframe (see Fig. 12). To find the motion patterns forcertain videos, we focus on the direction andmagnitude of the video sequences motion flows ofP frames only. The reason is that P frames giveonly forward prediction, and the information isuseful for us in order to calculate the direction ofthe ‘‘flow’’ of that video clip. Motion informationis specifically important to us as we are focusing onsports videos where the ‘‘motion flow’’ is asignificant cue.We did not track the object and extract

objection motion information directly as thisapproach is typically computationally complexand time consuming. Instead we tried to use somestatistical motion descriptors to see if they couldsatisfy our requirement. We calculated two suchstatistical features, dominant motion direction andthe motion magnitude of the motion vectors inthat clip.

Fig. 11. Whistle sound detection by using audio spectrum features.


4.4.1.1. Direction of motion descriptor extra-

ction. For each of the motion vectors in the vectorimage, we can cluster, and then classify eachmotion vector’s direction according to Fig. 13. Wecluster the vectors first to Fig. 13(b); the criteria isas follows: the vectors in Regions 1 and 8 aretermed as RIGHT; the vectors in the Regions 2and 3 as UP; the vectors in Regions 4 and 5 asLEFT; and the vectors in Regions 6 and 7 asDOWN. We then calculate the amount of motionalong each direction by counting the total numberof vectors along that direction in each class for thewhole video sequence. This results in a 4-D motiondirection vector.

4.4.1.2. Motion magnitude calculation. Normallythe magnitude of the motion is also encoded in themotion vector. To get the instances of magnitudeand speed of the motion descriptors along both X

and Y directions, we calculated the motionmagnitude of the whole frame according to

Eq. (4.5).

xaveði; jÞ ¼Pn�1

i¼0 xi

n; yaveði; jÞ ¼

Pm�1j¼0 yi

m; ð4:5Þ

where n and m are the total number of motionvectors in the frame with respect to the X directionand Y direction, respectively. We also calculatedaverage and biggest motion magnitude along the X

and Y directions for the whole video sequence.

4.4.2. Color features

In addition to motion, color and edge informa-tion also play an important role in objectidentification. In particular, we wanted to usecolor and edge information to classify a givenvideo clip key frame into four categories such asleft court, right court, middle court and so on, asshown in Figs. 14–16.To achieve this goal, a key frame was first

extracted for every newly detected scene. To avoid

Fig. 12. Forward prediction and bi-directional prediction for MPEG-1 video.

Fig. 13. Motion directions extraction.


unclear images caused by editing effects, such asdissolve, we chose the frame which marks the endof the first 10% of video sequences after a newscene has been detected as the key frame by usinghistogram scene-change detection algorithm. Wethen extracted color information such as colorhistogram, dominant color and regional colorinformation for each key frame. YV histogramsare automatically generated by the shot change

detection agents. The color histogram and thedominant color orientation histogram [25] arestatistical visual cues. They do not contain thespatial information. For example, some visuallydifferent key frames might have exactly the samecolor distribution, but different localized colors.Thus, to get more detailed color information andincrease the differentiating power of the colorfeature, we considered the dominant color and thelocalized color information. We used the median-cut algorithm [26] to reduce the color map toabout 256 colors. That is, colors in a image weremapped to their closest match in the new colormap so that the colors of the original images weremapped to their closest match in the new colormap so that the colors of the original images wereclustered. We used the clustering method toautomatically detect a number of dominant colorsand output them as a color tree for each frame.Then, all pixels were back mapped into homo-geneous regions, if the distance to a dominantcolor was not bigger than a given threshold, to getregions corresponding to the first five maximumdominant colors. Once we got the region for eachdominant color. The centroid and the regionboundary were calculated as the attribute valuesof the regional color.

Fig. 15. Features to cluster key frames into: (a) left court, (b) right court, (c) middle court, and (d) close-up.

Fig. 14. Various key frames of basket ball scenes (a) left court,

(b) right court, (c) middle court, and (d) close-ups.


4.4.3. Edge detection and analysis

Changes or discontinuities in an image ampli-tude attribute such as the luminance or thetristimulus value are fundamental characteristicsof an image since they often provide an indicationof the physical extent of object within the image.Edges characterize object boundaries and they canbe used in image segmentation, registration,identification and representation of an object inscenes. Also, edge patterns can be used to analyzekey characteristics and scene classification. Forexample, in boxing sports, there are generallyfence-like edges which contain more lines ofhorizontal edges than for soccer videos. Edgepatterns can also be treated as a simple texture ineach key frame image.Here, we used a gradient edge operator to detect

edges and take edge detection masks to fulfil theedge detection task. Edge detection masks in-cluded the first-order derivative masks, such as theprewitt and the robinson three-level masks. Edgedetection with an option of the second-orderLaplacian mask was also provided. In the first-order derivative masks, all eight directional maskswere used to detect out edges along differentdirections. Edge densities along different direc-

tions and different lengths around the dominantcolor region were calculated as edge features.Furthermore, we analyzed the edge informationalong four directions by calculating the distribu-tion of the visible edges and clustering them intohorizontal, vertical and other category edge types.In the meantime, each key frame of the video clipwas clustered into the categories shown in Figs. 14and 15. The flow chart for key frame clustering isshown in Fig. 16. In the experiment, the left courthas regional blue color in the middle of the left halfregion; the edge around the region is horizontaland vertical. By contrast, while the middle fielddoes not have a blue dominant color, it has avertical edge in the middle.

4.5. Video key frame clustering

Once all low-level visual proxies have generatedvisual feature values, clustering proxies can groupsimilar low-level features to identify key frames.For the basketball video, we used the color andedge information to classify a key frame into fourcategories such as the left court, the right court,the middle court and others, as shown in Fig. 14–16. In the experiment, the left court has a regionalgreen color in the middle of the left half region andthe edge around the region was horizontal andvertical. The middle field did not have the greendominant color. Instead, it has a vertical edge inthe middle.

5. Evaluation of the proposed system

5.1. System architecture

According to our layered video analysis model,we proposed and implement a system prototypefor a content-aware and user-preference-orientedmultimedia data distribution system over theInternet based on the multicast protocol. Fig. 17shows an infrastructure prototype to support on-line media content analysis and present a model ofservice that realizes the goal of providing effectiveon-line content-based media dissemination byfiltering. More specifically, we develop a real-timeintelligent multimedia system prototype consisting

Fig. 16. Feature extraction for key frame clustering.


of coordinated proxies for fast video contentanalysis and dissemination over the Internet basedon users’ profiles. One of its applications is tomanage contents of real-time collaborative ses-sions which generate various types of data streams,primarily raw audio, video, and graphics data,along with application-specific data types. As partof its operation, this system aims to analyze inputstreams based on users’ profiles, enrich the rawdata streams with semantic description tags, andcreate knowledge rules to capture high-levelconceptual meanings. Then, it will use the createdknowledge-based system to support filtering andad hoc queries of these data streams. However, anumber of multimedia data streams in their rawforms are not amenable to automated semanticinterpretation, and typically will have to beenhanced with other features, which are eithermanually created/attached or are extracted byanalyzing the raw data in off-line mode. Oursystem provides a set of intelligent tools to solvethis challenging task.As shown in Fig. 17, the system will provide on-

line feature extraction, multimedia stream contentunderstanding and organization, and data filteringby matching with user profiles for real-time mediadistribution and sharing over the Internet. Thissystem is expected to provide synchronized multi-media data stream distribution and filtering. Inaddition, the system will attempt to organizemultimedia resources over the Internet in a

scalable way, allowing users to find items relatedto their interest based on the content of the data.The system can also be extended to applicationssuch as interactive and personalize TV broadcastservices, personalized web services and trainingservices and collaborative applications. The wholesystem is composed of the following three mod-ules.

1. Content agents/proxies: To meet the goal of faston-line multimedia information access, everystream must be transmitted in real time or nearreal time, and quick content analysis andannotations are required to be properly classi-fied and disseminated by the architecture. Mostof the functionalities for content extraction,content analysis and data filtering/redistribu-tion as per user interests/query are fulfilled byintelligent content agents residing at properInternet nodes. In this architecture, the agent orproxy is an active software module that can beplaced throughout the network grid to performvarious operations needed for on-line multi-media data processing. Typically, the order ofexecuted for a set of proxies depends on aparticular request. For example, after anannotation proxy generates annotations of astream, a filtering proxy will perform matchingfunctions based on the learned knowledge andusers’ profiles, to properly re-route or cut offthe stream as necessary, and a transcoding

Fig. 17. The proposed on-line video stream analysis and dissemination infrastructure.


proxy (and/or a summarization proxy) willtransform the raw data to adapt to lowbandwidth networks. Note that it is possibleto combine some of these operations into asingle proxy. For example, the same proxy canperform both annotations and filtering.Furthermore, annotations extracted by contentagents can also support indexing and content-based retrieval.In the on-line video content analysis infra-

structure as shown in Fig. 17, each semanticmulticase content proxy consists of modules todo one or more specific data processing opera-tions and each runs as a daemon. The videocontent analysis proxies are centered to thehierarchical decomposition of video data, andextract visual/audio/motional characteristiccontents by combining and/or coordinatingvideo semantic class inference engines. High-level video semantics are inferred from low-levelfeatures for the filtering purpose. Extractedfeatures and semantic content can further serveas annotation or indexing for off-line databasemanagement. Some of the functions, performedby different proxies, are described below.* Annotation: The creation of a high-level

annotation tag for an information streamis an important form of content enrich-ment and is essential to effective informa-tion dissemination in semantic multicast.As agent may generate tags on session sub-streams (e.g. timestamp, or concept) toprepare for archival and filtering. In theirraw form, multimedia data types such asvideo and audio are not amenable toautomated semantic interpretation andtypically have to be enhanced with high-er-level features such as keywords, videoscene-change tags, and representative sam-ple frames. For example, and audio classi-fier can classify the audio signals intocategories, such as speech, noise or whistle.And speech-understanding systems canautomatically transcribe the audio streamin order to create a text of the spokenwords that can be utilized to allows thecreation of a time-aligned transcript of thespoken words contained in the audio

stream. At the next level, natural languageprocessing techniques can be applied tocorrect and summarize the transcript aswell as to identify key words that willdescribe logical sub-units of the entiresession, as defined by the video segmenta-tion operation.

* Filtering: An agent may subset a sessionbased on annotations to reduce the scopeto the interests of a particular group. Suchfiltering is generally time constrained tominimize the latency incurred in thedelivery of filtered information to users.In our filtering algorithms, we come with aknowledge base which can accommodatethe media low-level feature descriptor, plusdescription schemes to facilitate the filter-ing. Each feature descriptor has its ownspecific definition and extraction operator.

* Archival: An agent may store ‘‘appropri-ate’’ sub-sessions in an associated multi-media archive. As the agent archives thestream, it performs a more detained andoff-line analysis to provide additionalsemantic structuring and indexing forsubsequent retrieval and feedback to thesemantic multicast graph.

* Temporal synchronization of content with

descriptions: To allow the temporal asso-ciation of descriptions with content (AVobjects) that can vary over time andeffective media stream consumption, weuse timestamps of RTP media streams as asynchronization connection between var-ious media streams.

* Synthesis of multiple low-level features

associated with a content item: An agentwill allow flexible localization of descriptordata with one or more content objects. Avariety of descriptors and descriptionschemes could be associated with eachcontent item. Depending on the user’sprofile, not all of these will be used in allcases. In push applications (e.g. real-timemulticast/broadcast), effective featuresynthesis and data multiplexing are neededto satisfy various content requests fromusers.


* Buffer management of further decisions and

actions: Local storage and buffering man-agement is essential for real-time applica-tions, as every step of content analysis ordecision making generates latency.

2. Video semantic inference engine: However, as wediscussed in previous sections, low-level con-tents have limits in their ability to support moreadvanced information access demands, such ason-line filtering based on semantic contents. Toovercome this limitation, agents may archiveinformation streams and perform more detailedanalysis on the data to provide additionalsemantic structures for subsequent filteringand retrieval. The video semantic inferenceengine consists of the rule-based knowledgebase for the video classification subsystem. Aservice assigner manages and coordinates thecontent agents for on-line perceptual andsemantic content analysis other than filteringand query procedures.

3. Multimedia repository: A multimedia repositorywill store the data streams and any annotationsand transformations made by the contentagents, such as feature extraction proxies. Aswe discussed, video content is analyzed in bothperceptual and conceptual aspects; and multi-media streams, especially video, can be orga-nized and stored with indexings pointing to the

continuous streams with both perceptual andconceptual features. A smart interface to themultimedia repository provides the tools foroff-line data searching, retrieval and browsing.

5.2. On-line basketball video semantic classification

for filtering and indexing

5.2.1. Basketball video classification rules and

experimental results

The proposed knowledge- and rule-base videoclassification system is shown in Figs. 3 and 5.Knowledge for video classification is trained off-line first. In our experiment, sample video clips ofthe different categories were first identified andappropriate low-level features were created. Wethen utilized an entropy-based inductive tree-learning algorithm [16] to establish the trainedknowledge base of the video data type. Thisknowledge base is represented as a decision treewith each node in the tree being an if–then rule asapplied to a similarity metric utilizing an ‘‘appro-priate’’ low-level feature along with a good‘‘derived threshold:. The rule scheme for basket-ball is shown in Fig. 18, where the rule at eachlevel is depicted as /F ; yS: Note that the appro-priate feature F and a good threshold y areautomatically created by the training process.Note also that the semantic categories into which

Fig. 18. Rule tree for basketball video classification.


the video sequences will be classified form theleaves of the tree. This rule-based classifier beginswith the question ‘‘Which attribute should betested at the root of the tree?; and we aim for theattributes which are the most useful for classifyingthe examples. Next we may ask ‘‘What is a goodquantitative measure of the worth of an attri-bute?’’ and the tree provides the optimal procedureto find a value that can measure how well a givenattribute separates the training examples accord-ing to their target classification. A new video clip isthen classified as follows: following the tree, thefeature which was utilized at Level 1 (the rootlevel) I first extracted and the corresponding rule isapplied, following which the path selected ischosen. At the next level, the same step is carriedout whereby an appropriate feature is selected andthe corresponding rule applied. In this system,only the relevant features are extracted and theyare matched with the rule threshold directly.Further processing, such as data indexing, is maderight after the classification is done.We applied our system (Figs. 3 and 18) to

basketball videos by using on-line classifying andfiltering basketball into nine major meaningfulevents. They are:

1. Team offense at the left court;2. Team offense at the right court;3. fastbreak to the left;4. fastbreak to the right;5. dunk-like in the left court;6. dunk-like in the right court;7. scoring in the left court;8. scoring in the right court; and9. close-ups for audience or players.

We applied the proposed system to 157 basketballvideo clips segmented from a basketball game fortraining. After training, we arrived at an at mostthree-level decision tree that contains 14 rules, asshown in Fig. 15. Note that in the classificationstage, at most we have to do three calculations foreach class, as that is the level of the tree. No morethan six features are needed to classify all ninebasketball events. We used a set of basketballvideo data from one game to train the learningalgorithm to get the nine classes’ critical patternsand classifying rules that are of the differentiating

powers. From the rule tree, we see that using thedescriptor of key frame type alone, the programcan judge weather the video sequence is a close-upor not. To discern right to left fastbreak, first, weneed to judge key-frame type, and then judge thedominant motion direction and following thatjudge the average magnitude of the motion vectorcomponent along the X-axis. In other words, onlykey frame type and motion direction and averagemagnitude for the X-axis are relevant for right andleft fastbreak classes and thus these features aresuitable for fastbreak event threshold for eachspecific basketball event, which is especially usefulfor on-line user profile filtering.By applying the learned rules to classifying a

new set of 110 basketball game video clips, wereached a classification accuracy of from 70% to85.7% for the above nine identified basketballevents for the nine basketball classes as shown inTable 1. Here accuracy is defined as

Accuracy ¼#corrected

#corrected þ #false � alarmed:

5.2.2. Content agent coordination for filtering and

indexing

The proposed rule-based video classificationsystem is good for both on-line and off-line videoclassifications, which are applicable to videoindexing systems, video scene understanding andmining, on-line video filtering and video intelligentsummarization, and fast video browsing and so

Table 1

Results of basketball video classification by rules-based

classification system

Class Training

sample

Testing

sample

Accuracy

(%)

Left offense 20 14 75.8

Right offense 22 14 85.7

Left fastbreak 20 14 78.5

Right fastbreak 21 15 80.0

Left scores 15 12 75.0

Right scores 17 10 70.0

Left dunk 12 10 70.0

Right dunk 10 9 77.8

Close-up scenes 20 12 75.0

Total 157 110 78.2


on. General video classification problems caneasily follow the system prototype illustrated inFigs. 3 and 5. Once we have learned the knowledgeto classify each basket ball event, the rules are usedto build on-line feature extraction in response toboth archiving purposes and user-specified filteringpurposes.It is straightforward to apply such a rule-based

video classification system to on-line user-specifiedvideo filtering. Fig. 19 illustrates the data flow forsuch applications. For any user-specified videocategory, the knowledge base contains the corre-sponding characteristic features and rules toidentify it. Only those relevant features areextracted on-line and they are matched with rulethreshold. If they satisfy the rules, then the real-time stream matches with the user’s expectation;otherwise, it does not, and a further decision basedon this intelligent video classification will be madebased on the application. For example, to classifyleft-scoring events, we need to first segment thevideo into small units, and then extract key frames

and low-level features for clustering. It takesmultiple cooperating agents to realize final resultsfor video semantic classification, indexing anddissemination, as in shown in Fig. 20.In the current system design, the negotiation of

agent service is realized by requiring each proxy toimplement an ‘‘applicability’’ function that cap-tures the behavior and capability of the agent andexposes it to the semantic multicase framework.However, an agent often can process the inputdata streams but not necessarily transform theminto a form that satisfies the target group request.In this case, while the agent cannot by itselfcompletely service the data-processing needs, itmay still ‘‘partially’’ transform the input datastreams into others that can be further processedby other agents to satisfy the target group request.As a result, the applicability function is defined toreturn: (i) an intermediate group request that theagent can process the input data streams into, (ii)the specification of the intermediate data streams,and similar to the last case, and (iii) the config-uration parameters for the agent to perform theoperation. The intermediate group request anddata streams can then be treated as a new sourcerequest and data stream, and along with the targetgroup request, get passed to other agents todetermine if they can complete the mapping fromthe link between the source and target grouprequests to a series of agent instances.The characteristic features include low-level

features of video, such as key frames, colorhistograms, dominant colors and regional colors.For any incoming on-line key frame, videofeature-extraction agents extract low-level featuresfast, using the same feature descriptor and algo-rithm as those specified by the knowledge base. Akey frame classification proxy checks if the new

Fig. 19. Flow chart for on-line video filtering based on user’s

profile.

Fig. 20. Data flows in content analysis specified by rules: agents cooperating to realize video semantic classification for indexing and

dessemination.


key frame’s features are matched with those in theknowledge base and a binary decision (yes/no) foreach key frame semantic category will be output.This is of great use in the sense that once we canmake this distinction, we can (in the case of abasketball-interested user, say) either pass thevideo to the user if the user requires basketballvideo, or we may use the key frame of the basketball video as the basket ball event boundary forfiner semantics extraction from the video classifi-cation procedure, if his or her request is morespecific.Fig. 20 shows how the data streams carrying

‘‘all basketball video’’ may be processed by a seriesof semantic multicase agents to produce datastreams carrying ‘‘Left-scoring-related basketballevents’’ that satisfy the request represented by thecorresponding semantic multicase graph node. Inparticular, a video key frame extraction agent mayidentify natural break points in the newscast videoand a video segmentation agent may use that

information to separate the newscast into seg-ments representing independent new stories.Furthermore, after a video newscast has beensegmented, the closed caption text associated witheach news story segment may be analyzed andprocessed by a concept-filtering agent to identifynews stories related to left-scoring basketballevents for redistribution.As specified by the on-line video annotation and

classification procedure as shown in Fig. 5, appro-priate agents choose appropriate algorithms to gettarget feature descriptors (see Figs. 17 and 20), andoutput the binary classification result (yes/no) touser’s specification based on rules specified foreach basketball event. After on-line video segmen-tation and automatic low-level feature annotationand classification, basketball videos can be parsedand annotated as shown in Fig. 21. The annotatedmetadata can be further stored in a relationaldatabase as shown in Table 2 and serve as theindexes for the continuous basketball video games.

Table 2

Video annotation indexing table based on both perceptual and semantic content

VideolD SeqlD Start time End time Events Court Type MotionDV FeatureN

V1 S1 T1 T2 Left scoring Left (m1;m2) (f1; f2;y; fn)

— — — — — — — —

Fig. 21. Basketball indexed with event concepts.


These annotated indexes are based on both thelow-level perceptual features and the high-levelsemantic contents as specified by the basketballevent categories.The video archives of on-line live video streams

can be queried based on both semantic contentand low-level feature similarity matchings. Forexample, to search for basket ball left-scoringevent, we can have query execution plans likeEqs. (5.1) and (5.2)

Select V ;S

FromVideoID V ;SeqID S; Events E

Where Events ¼ left scoring ð5:1Þ

Or from Fig. 18, we can

Select V ;S

FromVideoID V ;SeqID S; CourtType C;

MotionDV M ;FeatureN Fn

Where Court ¼ Left AND M1 > ym AND Fnpyn

ð5:2Þ

In Eq. (5.2), ym and yn are the low-level featurevalue thresholds for each test feature used in theknowledge base. By applying queries usingEqs. (5.1) and (5.2), fast on-line and off-line videoinformation access are enabled.

6. Conclusion

In this paper, we have introduced a novel systemapproach to on-line knowledge- and rule-basedvideo classification — one that supports automaticindexing filtering based on the semantic concepthierarchy. Our research addresses not only thechallenges arising from general video managementissues (either on-line or off-line) such as videosemantic content analysis, but also the stringentrequirements imposed by on-line video processing.The difference in our work and the existingaccomplishments in the literature is that whilemost of them use static models for video classifica-tion to provide semantic indexing of off-linemultimedia databases, we use supervised learningtechniques to form an on-line classification systemand apply it specifically to basketball video eventindexing as an experimental example. a complete

prototype system has been developed to facilitateintelligent access to the rich multimedia data overthe Internet, and to evaluate the performance ofclustering and video/audio content analysis andfeatures-extraction techniques using sports data.At the core of our system, we have developed a

general video analysis model in conjunction withvarious techniques for fast and efficient videocontent analysis, such as scene-change detection,key frame selection, low-level feature extractionand clustering, and video semantic classification. Asupervised rule-based video classification system isproposed using video automatic segmentation,annotation and summarization techniques forseamless information browsing and updating.The rules were calculated using an inductivedecision-tree-learning approach applied to multi-ple low-level image features. The proposed rule-based video classification system is good both on-line and off-line, and is useful for numerous videoapplications. In particular, the classification sys-tem was applied to basketball clips with goodaccuracy, which shows that this system is effectiveand promising. Because the learning algorithmand low-level features are general, the proposedsystem is also suitable for other video domains ifthe appropriate new rules for the specific video areleaned in advance. ultimately, for videos fromdifferent domains, a more complete set of videofeatures may be extracted for training processing.One of the major architecture advantages of ouron-line multimedia content analysis system is theuse of intelligent proxies to encapsulate, coordi-nate, and combine distinct operations in optimumprocessing orders for semantic analysis of videoand audio contents. Moreover, the system proto-type can be made modular and scalable. In thefuture, we will extend our work to wider videodomains, such as for other sports like football orsoccer.

Acknowledgements

The work presented in this paper is supportedby the DARPA project on Semantic Multicase,No. N66001-97-2-8901.


References

[1] S. Dao, E. Shek, A. Vellaikal, R. Muntz, L. Zhang, M.

Potkonjak, Semantic multicast: intelligently sharing colla-

borative sessions, J. ACM Comput. Surveys (1999).

[2] S.F. Chang, W. Chen, H.J. Meng, H. Sundaram, D.

Zhong, VideoQ: an automated content based video search

system using visual cues, ACM Multimedia (1997).

[3] Y. Deng, B.S. Manjunath, Content-based search of video

using color, texture and motion, Proc. IEEE Int. Conf.

Imaging Process. 2 (1997) 534–537.

[4] Y. Deng, B.S. Manjunath, Spatio-temporal relationships

and video object extraction, Proceedings of the 32

Asilomar Conference on Signal, System and Computers,

November 1998.

[5] N. Dimitrova, F. Golshani, Motion recovery for video

content classification, ACM Trans. on Information. Syst.

13 (4) (1995) 408–439.

[6] G. Iyengar, A. Lippman, Models for automatic classifica-

tion of video sequences, Proceedings of the SPIE Multi-

media Storage and Archiving Systems, Vol. 3312, San Jose,

CA, 1998, pp. 216–227.

[7] A. Jaimes, S.F. Chang. Model-based classification of visual

information for content-based retrieval, Storage and

Retrieval for Image and Video Database VII, IS & T/

SPIE99, San Jose, January 1999.

[8] R.W. Picard, A society of models for video and image

libraries, IBM Syst. J. 35 (1996) 292–312.

[9] T.P. Minka, R.W. Picard, Interactive learning with a

society of models, Pattern Recognition 304 (4) (1997) 565–

581.

[10] D.D. Saur, Y.P. Tan, S.R. Kulkarni, P.J. Ramadge,

Automated analysis and annotation of basketball video.

SPIE 3022 (1997).

[11] Y. Gong, L.T. Sin, C.H. Chuan, H. Zhang, M. Sakauchi,

Automatic parsing of TV soccer programs, IEEE Trans.

(1995) 167–172.

[12] G. Sudhir, J.C.M. Lee, A.K. Jain, Automatic classification

of tennis video for high-level content-based retrieval, IEEE

Multimedia (1997).

[13] D. Crevier, R. Lepage, Knowledge-based image under-

standing system: a survey, Computer Vision and Image

Understanding 67 (1997) 161–185.

[14] A.M. Nazif, M.D. Levine, Low-level image segmentation:

an expert system, Pattern Anal. Mach. Intelll. 6 (1984)

555–577.

[15] R. Quinlan, C4.5: programs for machine learning, Morgan

Kaufmann, Berlin, 1993.

[16] T. Mitchell, Machine Learning, Mc-Graw Hills, New

York, 1997.

[17] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q.

Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D.

Petkovic, d. Steele, P. Yanker, Query by image and video

content: the QBIC system, IEEE Comput. 28 (9) (1995)

23–32.

[18] W. Zhou, Y. Shen, A. Vellaikal, C.C. Kuo, On-line scene

change detection of multicast (MBone) video, Proc. SPIE

(1998).

[19] W. Zhou, A. Vellaikal, Y. Shen, C.C. Kuo, Real-time

content-based processing of multicast video over the

internet, Proceedings of the 32nd Asilomar Conference

on Signals, System and Computers, November 1998.

[20] Boon-Lock Yeo, Bede Liu, Rapid scene analysis on

compressed video. IEEE Trans. Circuit and System Video

Technol. 5 (1995) 533–544.

[21] G. Ahangera, T.D.C. Little, A survey of technologies for

parsing and indexing digital video, J. of Visual Commun.

Representation 7 (1) (1996) 28–43.

[22] Tong Zhang, C.-C. JayKuo, Content-based classification

and retrieval of audio. SPIE’s 43rd Annual Meeting-

Conference on advanced Signal Processing Algorithms,

Architectures, and Implementations VII, SPIE Vol. 3461,

San Diego, July 1998, pp. 432–443.

[23] E. Wold, T. Blum, D. Keislar, J. Wheaten, Content-based

classification, search, and retrieval of audio, IEEE Multi-

media 3 (3) (1996) 27–36.

[24] N. Dimitrova, F. Golshani, Motion recovery for video

content classification, ACM Trans. Inform. Syst. 13 (4)

(1995) 408–439.

[25] S. Sclaroff, L. Taycher, M.L. Cascia, ImageRover: a

content-based image browser for the world wide web.

Proceedings of the IEEE Workshop on Content-based

Access of Image and Video Libraries, June 1997.

[26] P. Heckbert, Color image quantization for frame buffer

display. SIGGRAPH Proc. (1982) 297.


Documents

On-line knowledge- and rule-based video classification system for video indexing and dissemination