13
Semantic Retrieval of Video Ziyou Xiong 1 , Xiang Zhou 2 , Qi Tian 3 , Rui Yong 4 , and Thomas S. Huang 5 Abstract: In this article we will review different research works in 3 types of video, i.e., video of meetings, movies and broadcast news, and sports video. We will then put them into a general framework of video summarization, browsing, and retrieval. We will also review different video representation techniques for these three types of video content within this general framework. At last we will present the challenges facing the video retrieval research community. 1.INTRODUCTION Video content can be accessed by using either a top- down approach or a bottom-up approach [1, 2, 3, 4]. The top-down approach, i.e. video browsing, is useful when we need to get an “essence” of the content. The bottom-up approach, i.e. video retrieval, is useful when we know exactly what we are looking for in the content, as shown in Fig. 1 When we do not exactly know what we are looking for in the content, human-computer interaction, such as relevance feedback and active learning, can be used to better match the intentions and needs of the user with the video content. Figure 1.Relationship between video retrieval and browsing In the following, we give an overview of the research work on 3 major types of video content, i.e., video of meetings, movies and broadcast news, and sports. 1.1Video of Meetings Meetings are an important part of everyday life for many workgroups. Often, due to scheduling conflicts or travel constraints, people cannot attend all of their scheduled meetings. In addition, people are often only peripherally interested in a meeting such that they want to know what happened during the meeting without actually attending; being able to browse and skim these types of meetings could be quite valuable. Initial work on summarization and retrieval of video of meetings has been reported in [33]. 1.2Movies and Broadcast News Recently, movies and news videos have received great attention by the research community basically motivated by the interest of movie makers and broadcasters in building large digital archives of their assets for reuse of archive materials for TV programs or on-line availability to other companies [5]. Movies and news have a rather definite structure and do not offer a wide variety of edit effects, which are mainly cuts, or shooting conditions (e.g., illumination). This definite structure is suitable for content analysis and has been exploited for automatic classification, for example, in [6], [7], [8], [9], [10], [11]. In all of these systems a two stage scene classification scheme is employed. First, the video stream is parsed and video shots are extracted. Each shot is then classified according to content classes such as news report, weather forecast etc. The general approach to this type of classification relies on the definition of one or more image 1 United Technologies Research Center, East Hartford, CT, Email: [email protected] 2 Siemens Corporate Research, Princeton , NJ 08540, Email: [email protected] 3 Department of Computer Science, University of Texas at San Antonio, San Antonio, TX, Email: [email protected] 4 Microsoft Research, One Microsoft Way, Redmond, WA, E-mail:[email protected] 5 Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, E-mail: [email protected]

Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

Semantic Retrieval of VideoZiyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5

Abstract: In this article we will review differentresearch works in 3 types of video, i.e., video ofmeetings, movies and broadcast news, and sportsvideo. We will then put them into a generalframework of video summarization, browsing, andretrieval. We will also review different videorepresentation techniques for these three types ofvideo content within this general framework. At lastwe will present the challenges facing the videoretrieval research community.

1.INTRODUCTION

Video content can be accessed by using either a top-down approach or a bottom-up approach [1, 2, 3, 4].The top-down approach, i.e. video browsing, isuseful when we need to get an “essence” of thecontent. The bottom-up approach, i.e. videoretrieval, is useful when we know exactly what weare looking for in the content, as shown in Fig. 1When we do not exactly know what we are lookingfor in the content, human-computer interaction, suchas relevance feedback and active learning, can beused to better match the intentions and needs of theuser with the video content.

Figure 1.Relationship between video retrievaland browsing

In the following, we give an overview of theresearch work on 3 major types of video content,i.e., video of meetings, movies and broadcast news,and sports.

1.1Video of Meetings

Meetings are an important part of everyday life formany workgroups. Often, due to schedulingconflicts or travel constraints, people cannot attendall of their scheduled meetings. In addition, peopleare often only peripherally interested in a meetingsuch that they want to know what happened duringthe meeting without actually attending; being able tobrowse and skim these types of meetings could bequite valuable. Initial work on summarization andretrieval of video of meetings has been reported in[33].

1.2Movies and Broadcast News

Recently, movies and news videos have receivedgreat attention by the research community basicallymotivated by the interest of movie makers andbroadcasters in building large digital archives oftheir assets for reuse of archive materials for TVprograms or on-line availability to other companies[5]. Movies and news have a rather definite structureand do not offer a wide variety of edit effects, whichare mainly cuts, or shooting conditions (e.g.,illumination). This definite structure is suitable forcontent analysis and has been exploited forautomatic classification, for example, in [6], [7], [8],[9], [10], [11]. In all of these systems a two stagescene classification scheme is employed. First, thevideo stream is parsed and video shots are extracted.Each shot is then classified according to contentclasses such as news report, weather forecast etc.The general approach to this type of classificationrelies on the definition of one or more image

1 United Technologies Research Center, East Hartford, CT, Email: [email protected] Siemens Corporate Research, Princeton , NJ 08540, Email: [email protected] Department of Computer Science, University of Texas at San Antonio, San Antonio, TX, Email: [email protected] Microsoft Research, One Microsoft Way, Redmond, WA, E-mail:[email protected] Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Urbana, IL, E-mail:

[email protected]

Page 2: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

templates for each content class. To classify ageneric shot, a key frame is extracted and matchedagainst the image template of every content class.Other works deal with the problem of videoindexing using information sources like the text ofcaptions and the audio track [7], [12]. This is due tothe fact that movie and news images have anancillary function with respect to words and videocontent is strongly related to textual and audioinformation that is contained in the audio. Also,speaker identity is important information that caneffectively index the movie content [13].1.3Sports Video

Event indexing and highlight extraction in sportsvideo have been actively studied recently due totheir tremendous commercial applications. Manyresearchers have studied the respective role ofvisual, audio and textual mode in this domain. Forexample, for the visual mode, Kawashima et al. [14]have extracted bat-swing features based on the videosignal. Xie et al. [15] and Xu et al. [16] havesegmented soccer videos into play and breaksegments using dominant color and motioninformation. Gong et al. [17] have targeted parsingTV soccer programs. By detecting and tracking thesoccer court, ball, players, and motion vectors, theywere able to distinguish nine different positions ofthe play (e.g., mid-field, top-right corner of thecourt, etc.). Ekin and Tekalp [18] have analyzedsoccer video based on video shot detection andclassification; for the audio mode, Rui et al. [19]have detected the announcer's excited speech andball-bat impact sound in baseball games usingdirectional audio template matching; for the textualmode, Babaguchi et al. [20] have looked for timespans in which events are likely to take placethrough extraction of keywords from the closedcaptioning stream. Their method has been applied toindex events in American football video. Since the content of sports video is intrinsicallymultimodal, many researchers have also proposeddifferent information fusion schemes to combinedifferent modes. A good reference is Snoek andWorring [21].

1.4An Analogy

How does a reader efficiently access the content of a1000-page book? Without reading the whole book,he can first go to the ToC to find which chapters or

sections suit his needs. If he has specific questions(queries) in mind, such as finding a term, he can goto the Index at the end of the book and find thecorresponding sections addressing that question. Onthe other hand, how does a reader efficiently accessthe content of a 100-page magazine? Withoutreading the whole magazine, he can either directlygo to the featured articles listed on the front page oruse the ToC to find which article suits his needs. Inshort, the book's ToC helps a reader browse, and thebook's index helps a reader retrieve. Similarly, themagazine's featured articles also help the readerbrowse through the highlights. All these threeaspects are equally important in helping users accessthe content of the book or the magazine. For today'svideo content, techniques are urgently needed forautomatically (or semi-automatically) constructingvideo ToC, video Highlights and video Indices tofacilitate summarization, browsing and retrieval.

2.Terminology

Before we go into the details of the discussion, itwill be beneficial to first introduce some importantterms used in the digital video research field.Scripted/Unscripted Content: A video that iscarefully produced according to a script or plan thatis later edited, compiled and distributed forconsumption is referred to as scripted content. Newsvideos, dramas and movies are examples of scriptedcontent. Video content that is not scripted is thenreferred to as unscripted. In unscripted content, suchas sports video and meetings video, the eventshappen spontaneously. One can think of varyingdegrees of “scripted-ness” and “unscripted-ness”from movie content to surveillance content. Video shot: is a consecutive sequence of framesrecorded from a single camera. It is the buildingblock of video streams. Key frame: is the frame which represents the salientvisual content of a shot. Depending on thecomplexity of the content of the shot, one or morekey frames can be extracted.Video scene: is defined as a collection ofsemantically related and temporally adjacent shots,depicting and conveying a high-level concept orstory. While shots are marked by physicalboundaries, scenes are marked by semanticboundaries

Page 3: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

Video group: is an intermediate entity between thephysical shots and semantic scenes and serves as thebridge between the two. Examples of groups aretemporally adjacent shots [22] or visually similarshots [3].Play and Break: is the first level of semanticsegmentation in sports video and surveillance video.In sports video (e.g soccer, baseball, golf), a game isin play when the ball is in the field and the game isgoing on; break, or out of play, is the complementset, i.e., whenever “the ball has completely crossedthe goal line or touch line, whether on the ground orin the air” or “the game has been halted by thereferee” [23]. In surveillance video, a play is aperiod in which there is some activity in the scene.Audio Marker: is a contiguous sequence of audioframes representing a key audio class that isindicative of the events of interest in the video. Anexample of an audio marker for sports video can bethe audience reaction sound (cheering and applause)or commentator's excited speech.Video Marker: is a contiguous sequence of videoframes containing a key video object that isindicative of the events of interest in the video. Anexample of a video marker for baseball videos is thevideo segment containing the squatting catcher atthe beginning of every pitch. Highlight Candidate: is a video segment that islikely to be remarkable and can be identified usingthe video and audio markers. Highlight Group: is a cluster of highlight candidates.In summary, scripted video data can be structuredinto a hierarchy consisting of five levels: video,scene, group, shot, and key frame, which increase ingranularity from top to bottom [4] (see Fig. 2).Similarly, the unscripted video data can bestructured into a hierarchy of four levels: play/break,audio-visual markers, highlight candidates, highlightgroups, which increase in semantic level frombottom to top (see Fig. 3).

Figure 2.A hierarchical video representation forscripted content

Figure 3.A hierarchical video representation forunscripted content3.Video Representation and Indexing

Considering that each video frame is a 2D objectand the temporal axis makes up the third dimension,a video stream spans a 3D space. Videorepresentation is the mapping from the 3D space tothe 2D view screen. Different mapping functionscharacterize different video representationtechniques.

3.1Video of Meetings

We use the following user interface as an example toshow video representation and indexing for meetingvideos in Fig. 4. A low-resolution version of theRingCam panorama image is shown in the lowerpart of the client. A high resolution image of thespeaker is shown in the upper left, which can eitherbe automatically selected by the virtual director ormanually selected by the user (by clicking within thepanoramic image).

Page 4: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

Figure 4.Meeting browsing client: Panoramawindow (bottom), speaker window (upper left),whiteboard (upper right), timeline (bottom).

The timeline is shown in the bottom of the window,which shows the results of speaker segmentation.The speakers are automatically segmented andassigned a unique color. The person IDs have beenmanually assigned, though this process could beautomated by voice identification. The remoteviewer can select which person to view by clickingon that person’s color. The speakers can also befiltered, so that playback will skip past all speakersnot selected. The playback control section to the left of thepanorama allows the remote view to seek to the nextor previous speaker during playback.In the following, we describe the major modules ofthe system in more detail.

Figure 5.RingCam: an inexpensiveomnidirectional camera and microphone array

designed for capturing meetings.

3.1.1Sound Source Localization

At the base of the RingCam (see Figure 5) is an 8-element microphone array used for beamformingand sound source localization. The microphonearray has an integrated preamplifier and uses anexternal 1394 A/D converter (Motu828) to transmit8 audio channels at 16-bit 44.1KHz to the meetingroom server via a 1394 bus. The goal for soundsource localization (SSL) is to detect which meetingparticipant is talking. For more reference on theRingCam, please see [24].3.1.2Person Detection and Tracking

Although audio-based SSL can detect who istalking, its spatial resolution is not high enough tofinely steer a virtual camera view. In addition,occasionally it can lose track due to room noise,reverberation, or multiple people speaking at thesame time. Vision-based person tracking is a naturalcomplement to SSL. Though it does not know whois talking, it has higher spatial resolution and tracksmultiple people at the same time. However, robust vision-based multi-person trackingis a challenging task, even after years of research inthe computer vision community. The difficultiescome from the requirement of being fully automaticand being robust to many potential uncertainties.After careful evaluation of existing techniques, weimplemented a fully automatic tracking system byintegrating three important modules: auto-initialization, multi-cue tracking and hierarchicalverification [25].

Working together, these three modules achieve goodtracking performance in real-world environment. Atracking example is shown in Fig. 4 with whiteboxes around the person’s face.

3.1.3Sensor Fusion and Virtual Director

The responsibilities of the virtual director (VD)module are two folds. At a lower level, it conductssensor fusion to gather and analyze reports from theSSL and multi-person tracker and make intelligentdecisions on what the speaker window (the top leftwindow in Fig. 4) should show. We use particle filters(PF) to obtain the speaker location. The proposalfunction of the PF is obtained from individualaudio/video sensors, e.g., the person tracking moduleand SSL module. Particles are then drawn from the

Page 5: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

proposal function that is then weighted and propagatedthrough time. For a detailed description of the PF-based sensor fusion algorithm, readers are referred to[26]. At a higher level, just like video directors in real life,the VD module observes the rules of thecinematography and video editing in order to make therecording more informative and entertaining [24]. Forexample, when a person is talking, the VD shouldpromptly show that person. If two people are talkingback and forth, instead of switching between these twospeakers, the VD may decide to show them togetherside by side (note that our system captures the entire360º view). Another important rule to follow is that thecamera should not switch too often; otherwise it maydistract viewers. 3.1.4Speaker Segmentation and Clustering

For archived meetings, an important value-addedfeature is speaker clustering. If a timeline can begenerated showing when each person talked duringthe meeting, it can allow users to jump betweeninteresting points, listen to a particular participant,and better understand the dynamics of the meeting.The input to this preprocessing module is the outputfrom the SSL, and the output from this module is thetimeline clusters. There are two components in thissystem: pre-filtering and clustering. During pre-filtering, the noisy SSL output will be filtered andoutliers thrown away. During clustering, K-mean’sclustering is used during the first a few iterations tobootstrap, and a mixture of Gaussians clustering isthen used to refine the result. An example timelinecluster is shown in the lower portion of Fig. 4.

3.2Movies and Broadcast News

Using an analysis framework that can detect shots,key frames and scenes, it is possible to come-upwith the following representations for scriptedcontent such as movies and broadcast news [8]. Anexample is shown in Fig. 6.

Figure 6. An example of video ToC

3.2.1Representation based on Sequential KeyFrames

After obtaining shots and key frames, an obviousand simple video representation is to sequentiallylay out the key frames of the video, from top tobottom and from left to right. This simple techniqueworks well when there are few key frames. Whenthe video clip is long, this technique does not scale,since it does not capture the embedded informationwithin the video clip, except for time.3.2.2Representation based on Groups

To obtain a more meaningful video representationwhen the video is long, related shots are merged intogroups [22], [3]. In [22], Zhang et al. divide theentire video stream into multiple video segments,each of which contains an equal number ofconsecutive shots. Each segment is further dividedinto sub-segments; thus constructing a treestructured video representation. In [3], Zhong et al.proposed a cluster-based video hierarchy, in whichthe shots are clustered based on their visual content.This method again constructs a tree structured videorepresentation. 3.2.3Representation based on Scenes

To provide the user with better access to the video,the construction of a video representation at thesemantic level is needed [4], [2]. It is not uncommonfor a modern movie to contain a few thousand shotsand key frames. This is evidenced in [27] -- there are300 shots in a 15-minute video segment of themovie “Terminator 2 - Judgment Day” and themovie lasts 139 minutes. Because of the large

Page 6: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

number of key frames, a simple 1D sequentialpresentation of key frames for the underlying video(or even a tree structured layout at the group level)is almost meaningless. More importantly, peoplewatch the video by its semantic scenes rather thanthe physical shots or key frames. While a shot is thebuilding block of a video, it is a scene that conveysthe semantic meaning of the video to the viewers.The discontinuity of shots is overwhelmed by thecontinuity of a scene [2]. Video ToC construction atthe scene level is thus of fundamental importance tovideo browsing and retrieval. In [2], a scenetransition graph (STG) of video representation isproposed and constructed. The video sequence isfirst segmented into shots. Shots are then clusteredby using time-constrained clustering. The STG isthen constructed based on the time flow of theclusters. 3.2.4Representation based on Video Mosaics

Instead of representing the video structure based onthe video-scene-group-shot-frame hierarchy asdiscussed above, this approach takes a differentperspective [28]. The mixed information within ashot is decomposed into three components:

• Extended spatial information: this capturesthe appearance of the entire backgroundimaged in the shot, and is represented in theform of a few mosaic images.

• Extended temporal information: thiscaptures the motion of independentlymoving objects in the form of theirtrajectories.

• Geometric information: this captures thegeometric transformations that are inducedby the motion of the camera.

3.3Sports Video

Highlights extraction from unscripted contentrequires a different representation from the one thatsupports browsing of scripted content. This isbecause shot detection is known to be unreliable forunscripted content. For example, in soccer video,visual features are so similar over a long period oftime that almost all the frames within it, may begrouped as a single shot. However, there might bemultiple semantic units within the same period suchas attacks on the goal, counter attacks in the mid-field, etc. Furthermore, the representation of

unscripted content should emphasize detection ofremarkable events to support highlights extractionwhile the representation for scripted content doesnot fully support the notion of an event beingremarkable compared to others. An example of thedetected baseball highlights from a baseball video isshown in Fig. 7.For unscripted content, using an analysis frameworkthat can detect plays and specific audio and visualmarkers, it is possible to come up with the followingrepresentations.

Figure 7. Highlight segments in a baseball video

3.3.1Representation based on Play/BreakSegmentation

As mentioned earlier, play/break segmentation usinglow-level features gives a segmentation of thecontent at the lowest semantic level. By representinga key frame from each of the detected playsegments, one can enable the end user to select justthe play segments.3.3.2Representation based on Audio-Visual Markers

The detection of audio and visual markers enables arepresentation that is at a higher semantic level thanplay/break representation is. Since the detectedmarkers are indicative of the events of interest, theuser can use both of them to browse the contentbased on this representation.3.3.3Representation based on Highlight Candidates

Page 7: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

Association of an audio marker with a video markerenables detection of highlight candidates that are ata higher semantic level. Such a fusion ofcomplementary cues from audio and video helpseliminate false alarms in either of the markerdetectors. Segments in the vicinity of a video markerand an associated audio marker give access to thehighlight candidates for the end-user. For instance, ifthe baseball catcher pose (visual marker) isassociated with an audience reaction segment (audiomarker) that follows it closely, the correspondingsegment is highly likely to be remarkable orinteresting.3.3.4Representation based on Highlight Groups

Grouping of highlight candidates would give a finerresolution representation of the highlight candidates.For example, golf swings and putts share the sameaudio markers (audience applause and cheering) andvisual markers (golfers bending to hit the ball). Arepresentation based on highlight groups, supportsthe task of retrieving finer events such as “golfswings only” or “golf putts only”.

4.Sports Video Highlights Extraction

In this section, we describe our proposed approachfor highlights extraction from “unscripted” content.We show the framework's effectiveness in threedifferent sports namely soccer, baseball and golf.Our proposed framework can be summarized in Fig.8. There are 4 major components in Fig. 8. Wedescribe them one by one in the following.

Figure 8.Proposed approach for sports highlightsextraction

4.1Audio Marker Detection

Broadcast sports content usually includes audiencereactions to the interesting moments of the games.Audience reaction classes including applause,cheering, and commentator's excited speech canserve as audio markers. Fig. 9 shows a unifiedaudio marker detection framework for sportshighlights extraction.

Figure 9.Audio markers for sports highlightsextraction

Fig. 10 shows examples of some visual markers forthree different games. For baseball games, we wantto detect the pattern in which the catcher squatswaiting for the pitcher to pitch the ball; for golfgames, we want to detect the players bending to hitthe golf ball; for soccer, we want to detect theappearance of the goal post. Correct detection ofthese key visual objects can eliminate the majorityof the video content that is not in the vicinity of theinteresting segments. For the goal of one generalframework for all three sports, we use the followingprocessing strategy: for the unknown sports content,we detect whether there are baseball catchers, orgolfers bending to hit the ball, or soccer goal posts.The detection results can enable us to decide whichsport (baseball, golf or soccer) it is.

Page 8: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

Figure 10.Examples of visual markers fordifferent sports

An object detection algorithm such as Viola andJones's [29] has been used to detect these visualmarkers. We can compare the number of detectionswith those in the ground truth set (marked by humanviewers). We show the precision-recall curve forbaseball catcher detection in Fig. 11. We haveachieved, for example, 80% precision for a recall of70%.

Figure 11.The precision-recall curve of baseballcatcher detection4.2Audio-Visual Markers Negotiation forHighlights Candidates Generation

Ideally each visual marker can be associated withone and only one audio marker and vice versa. Thusthey make a pair of audio-visual markers indicatingthe occurrence of a highlight event in their vicinity.But, since many pairs might be wrongly groupeddue to false detections and misses, some post-

processing is needed to keep the error to aminimum. An algorithm might be as follows:

• If a contiguous sequence of visual markersoverlaps with a contiguous sequence ofaudio markers by a large margin (e.g., thepercentage of overlapping is greater than50\%), then we form a “highlight” segmentspanning from the beginning of the visualmarker sequence to the end of the audiovisual marker sequence.

• Otherwise, associate a visual markersequence with the nearest audio markersequence that follows it if the durationbetween the two is less than a durationthreshold (e.g., the average duration of a setof training “highlights” clips from baseballgames).

4.3Finer-Resolution Highlights Recognition andVerification

Highlight candidates, delimited by the audiomarkers and visual markers, are quite diverse. Forexample, golf swings and putts share the same audiomarkers (audience applause and cheering) and visualmarkers (golfers bending to hit the ball). Both ofthese two kinds of golf highlight events can befound by the aforementioned audio-visual markersdetection based method. To support the task ofretrieving finer events such as “golf swings only” or“golf putts only”, we have developed techniques thatmodel these events using low level audio-visualfeatures.Furthermore, some of these candidates might not betrue highlights. We eliminate these false candidatesusing a finer-level highlight classification method.For example, for golf, we build models for golfswings, golf putts and non-highlights (neitherswings nor putts) and use these models forhighlights classification (swings or putts) andverification (highlights or non-highlights).As an example, let us look at finer level highlightclassification for a baseball game using low-levelcolor features. The diverse baseball highlightcandidates found after the audio markers and visualmarkers negotiation step are further separated usingthe techniques described here. For baseball, there aretwo major categories of highlight candidates, thefirst being “balls or strikes” in which the batter doesnot hit the ball, the second being “ball-hits” in which

Page 9: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

the ball is hit to the field or audience. These twocategories have different color patterns. In the firstcategory, the camera is fixed at the pitch scene, sothe variance of color distribution over time is low. Inthe second category, in contrast, the camera firstshoots at the pitch scene, then it follows the ball tothe field or the audience, so the variance of colordistribution over time is higher.

5.Video Summarization and Browsing

In video summarization, what “essence” thesummary should capture depends on whether thecontent is scripted or not. Since scripted content,such as news, drama, and movie, is carefullystructured as a sequence of semantic units, one canget its essence by enabling a traversal throughrepresentative items from these semantic units.Hence, Table of Contents (ToC) based videobrowsing caters to summarization of scriptedcontent. For instance, a news video composed of asequence of stories can be summarized/browsedusing a key-frame representation for each of theshots in a story. However, summarization ofunscripted content, such as meetings and sports,requires a “highlights” extraction framework thatonly captures remarkable events that constitute thesummary.5.1ToC Based Browsing

For “Representation based on Sequential KeyFrames”, browsing is sequential, scanning from thetop-left key frame to the bottom-right key frame. For“Representation based on Groups”, a hierarchicalbrowsing is supported [22], [3]. At the coarse level,only the main themes are displayed. Once the userdetermines which theme he is interested in, he canthen go to the finer level of the theme. Thisrefinement process can go on until the leaf level. Forthe STG representation, a major characteristic is itsindication of time flow embedded within therepresentation. By following the time flow, theviewer can browse through the video clip.5.2Highlights Based Browsing

For “Representation based on Play/BreakSegmentation”, browsing is also sequential,enabling a scan of all the play segments from thebeginning of the video to the end. “Representationbased on Audio-Visual Markers” supports queriessuch as “find me video segments that contain the

soccer goal post in the left-half field”, “find mevideo segments that have the audience applausesound” or “find me video segments that contain thesquatting baseball catcher”. “Representation basedon Highlight Candidates” supports queries such as“find me video segments where a golfer has a goodhit” or “find me video segments where there is asoccer goal attempt”. Note that “a golfer has a goodhit” is represented by the detection of the golferhitting the ball followed by the detection of applausefrom the audience. Similarly, that “there is a soccergoal attempt” is represented by the detection of thesoccer goal post followed by the detection of longand loud audience cheering. “Representation basedon Highlight Groups” supports more detailedqueries than the previous representation. Thesequeries include “find me video segments where agolfer has a good swing”, “find me video segmentswhere a golfer has a good putt”, or “find me videosegments where there is a good soccer corner kick”etc.

6.Challenging Problems in Video Retrieval

In this paper we have reviewed some of theimportant issues in semantic video (esp. video)retrieval. Now, we step back to look at the broadfield of content-based video retrieval, and try toascertain: what are the most challenging researchproblems facing us?

1) Bridging the semantic gapPerhaps the most desirable mode of image/videoretrieval is still by keywords or phrases. However,manually annotate the images and video data isextremely tedious. To do annotation automaticallyor semi-automatically, we need to bridge the"semantic gap", i.e., to find algorithms that will inferhigh-level semantic concepts (sites, objects, events)from low-level image/video features that can beeasily extracted from the data (color, texture, shapeand structure, layout; motion; audio - pitch, energy,etc.).One sub-problem is Audio Scene Analysis.Researchers have worked on Visual Scene Analysis(Computer Vision) for many years, but Audio SceneAnalysis is still in its infancy, and an under-exploredfield.

Page 10: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

Another sub-problem is multimodal fusion, esp. howto combine visual and audio cues to bridge thesemantic gap in video.

2) How to best combine human intelligence andmachine intelligenceOne advantage of information retrieval is that inmost scenarios there is a human (or humans) in theloop. One prominent example of human-computerinteraction is Relevance Feedback.

3) New Query ParadigmsFor image/video retrieval, people have tried queryby keywords, similarity, sketching an object,sketching a trajectory, painting a rough image, etc.Can we think of useful new paradigms?

4) Video Data MiningSearching for interesting/unusual patterns andcorrelations in video has many importantapplications, including Web Search Engines anddealing with intelligence data. Work to date on DataMining has been mainly in Text data.

5) How To Use Unlabeled Data?We can consider Relevance Feedback as a two-category classification problem (relevant orirrelevant). However, the number of trainingsamples is very small. Can we use the large numberof unlabeled samples in the database to help? Also,how about active learning (to choose the bestsamples to return to the user to get most informationabout the user's intention through feedback)? Another problem related to image/video dataannotation is Label Propagation. Can we label asmall set of data and let the labels propagate to theunlabeled samples?

6) Incremental LearningIn most applications, we keep adding new data tothe database. We should be able to change theparameters of the retrieval algorithms incrementally,not needing to start from scratch every time we havenew data.

7) Using Virtual Reality Visualization To Help

Can we use 3D audio/visual visualization techniquesto help a user to navigate through the data space tobrowse and to retrieve?

8) Structuring Very Large DatabasesResearchers in audio/visual scene analysis and thosein Databases and Information Retrieval should reallycollaborate CLOSELY to find good ways ofstructuring very large video databases for efficientretrieval and search.

9) Performance EvaluationHow do we compare the performances of differentretrieval algorithms?

10) What Are the Killer Applications of VideoRetrieval?Few real applications of video retrieval have beenaccepted by the general public so far. Is web videosearch engine going to be the next killer application?It remains to be seen. With no clear answer to thisquestion, it is still a challenge to do research that isappropriate for real applications.

7.Acknowledgement

We thank Dr. Regunathan Radhakrishnan and Dr.Ajay Divakaran of Mitsubishi Electric ResearchLabs, Cambridge, MA for their contributions andhelpful discussions.

8.REFERENCES

[1] H. Zhang, A. Kankanhalli, and S. W. Smoliar,"Automatic partitioning of full-motion video," ACMMultimedia Sys., vol. 1, no. 1, pp. 1-12, 1993.[2] R. M. Bolle, B.-L. Yeo, and M. M. Yeung,"Video query: Beyond the keywords," TechnicalReport, IBM Research, Oct. 17, 1996.[3] D. Zhong, H. Zhang, and S.-F. Chang,"Clustering methods for video browsing andannotation," Tech. Rep., Columbia University, 1997.[4] Y. Rui, T. S. Huang, and S. Mehrotra,"Exploring video structures beyond the shots," inProc. of IEEE Conf. Multimedia Computing andSystems, 1998.

Page 11: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

[5] M. Bertini, A. D. Bimbo, and P. Pala, “Contentbased indexing and retrieval of TV news,” PatternRecognition Letters, 22, pp. 503-516, 2001.[6] D. Swanberg, C. Shu, and R. Jain, “Knowledgeguided parsing in video databases,” SPIE 1908 (13),pp. 13-24, 1993. [7] K. Ohtsuki, K. Bessho, Y. Matsuo, S.Matsunaga, and Y. Hayashi, “Automatic Indexing ofBroadcast News by Combining Audio, Speech andVisual Information”, in this issue.[8] C. Cotsaces, N. Nikolaidis, and I. Pitas, "VideoShot Boundary Detection and CondensedRepresentation: A Review", in this issue.[9] A. Hanjalic, "Extracting Moods from Picturesand Sounds: Towards Truly Personalized TV", inthis issue.[10] X. Wu, C.-W. Ngo, and Q. Li, "Threading andAuto-Documentary in News Videos", in this issue.[11] A. Kokaram, N. Rea, R. Dahyot, M. Tekalp, P.Bouthemy, P. Gros, and I. Sezan, "Browsing sportsvideo", in this issue.[12] A. G. Hauptmann, and M. J. Witbrock,“Informedia: news-on-demand multimediainformation acquisition and retrieval,” IntelligentMultimedia Information Retrieval, pp. 213-239,1997.[13] Y. Li, S. Narayanan, and C.-C J. Kuo,“Content-based movie analysis and indexing basedon audio visual cues,” IEEE Trans. Circuits andSystems for Video Technology, Vol. 14, No. 8,August 2004.[14] T. Kawashima, K. Tateyama, T. Iijima, and Y.Aoki, "Indexing of baseball telecast for content-based video retrieval," in Proceedings of theInternational Conference on Image Processing,1998, pp. 871-874.[15] L. Xie, S. F. Chang, A. Divakaran, and H. Sun,"Structure analysis of soccer video with hiddenMarkov models," in Proceedings of the InternationalConference on Acoustic, Speech, and SignalProcessing, vol. 4, May 2002, pp. 4096-4099.[16] P. Xu et al., "Algorithms and system forsegmentation and structure analysis in soccervideo," in Proceedings of IEEE Conference onMultimedia and Expo, Aug. 2001, pp. 928-931.[17] Y. Gong, L. T. Sin, C. H. Chuan, H. Zhang, andM. Sakauchi, "Automatic parsing of TV soccerprograms," in IEEE International Conference on

Multimedia Computing and Systems, 1995, pp. 167-174.[18] A. Ekin and A. M. Tekalp, "Automatic soccervideo analysis and summarization," in Proceedingsof the International Conference on ElectronicImaging: Storage and Retrieval for MediaDatabases, 2003, pp. 339-350.[19] Y. Rui, A. Gupta, and A. Acero,"Automatically extracting highlights for TV baseballprograms," in Proceedings of the Eighth ACMInternational Conference on Multimedia, 2000, pp.105-115.[20] N. Babaguchi, Y. Kawai, and T. Kitahashi,"Event-based indexing of broadcasted sports videoby intermodal collaboration," IEEE Transactions onMultimedia, vol. 4, no. 1, pp. 68-75, March 2002.[21] C. Snoek and M. Worring, "Multimodal videoindexing: A review of the state-of-the-art," Tech.Rep., Intelligent Sensory Information SystemsGroup, University of Amsterdam, Technical Report2001-20, 2001.[22] Zhang, S. W. Smoliar, and J. J. Wu, "Content-based video browsing tools," in Proc. ISandT/SPIEConf. on Multimedia Computing and Networking,1995.[23] L. Xie, S.-F. Chang, A. Divakaran, and H. Sun,"Structure analysis of soccer video with domainknowledge and hidden Markov models," PatternRecognition Letters, 2004. to appear.[24] R. Cutler, Y. Rui, et. al. Distributed meetings: ameeting capture and broadcasting system, Proc. ofACM Multimedia 2002.[25] Y. Chen, Y. Rui and T. Huang, JPDAF BasedHMM for Real-Time Contour Tracking, , Proc. ofIEEE CVPR 2001, pp.I-543 to 550, Kauai, Hawaii,December 11-13, 2001[26] Y. Chen and Y. Rui Real-time SpeakerTracking Using Particle Filter Sensor Fusion,Proceedings of the IEEE , vol. 92, no. 3, pp. 485-494, Mar. 2004[27] M. Yeung, B.-L. Yeo, and B. Liu, "Extractingstory units from long programs for video browsingand navigation," in Proc. IEEE Conf. on MultimediaComput. and Syss, 1996.[28] M. Irani and P. Anandan, "Video indexingbased on mosaic representations," Proceedings ofIEEE, vol. 86, pp. 905{921, May 1998.

Page 12: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

[29] P. Viola and M. Jones, "Robust real-time objectdetection," Second International Workshop onStatistical and Computational Theories of Vision -Modeling, Learning, Computing and Sampling, July2001. Vancouver, Canada.

Ziyou Xiong received his BS degree from WuhanUniversity, Hubei Province, China, in July 1997. Hereceived his MS degree in electrical and computerengineering from University of Wisconsin, Madisonin December 1999 and PhD degree in electrical andcomputer engineering from University of Illinois atUrbana Champaign (UIUC) in October 2004. He hasalso been a research assistant with the ImageFormation and Processing Group of the BeckmanInstitute for Advanced Science and Technology atUIUC from January 2000 to August 2004. In thesummers of 2003 and 2004, he worked on sportsaudio-visual analysis at Mitsubishi ElectricResearch Labs, Cambridge, Massachusetts. SinceSeptember 2004, he has been with the DynamicModeling and Analysis group at the UnitedTechnologies Research Center as a seniorresearcher/scientist in East Hartford, Connecticut.His current research interests include image andvideo analysis, video surveillance, computationalaudio-visual scene analysis, pattern recognition,machine learning and related applications. He haspublished several journal and conference papers,invited book chapters on audio-visual personrecognition, image retrieval, and sports videoindexing, retrieval, and highlight extraction. He hasalso coauthored a book titled Facial Analysis fromContinuous Video with Application to Human-Computer Interface (Kluwer, 2004).

Xiang Zhou received the PhD degree in electricalengineering from the University of Illinois at UrbanaChampaign (UIUC) in 2002. Previously, he receivedthe bachelor’s degree in automation and ineconomics and management (minor) and studiedeconomics for two years in a PhD program atTsinghua University, China. Since 2002, he hasbeen with Siemens Corporate Research in Princeton,New Jersey. His research interests include computervision and machine learning, object detection,tracking, and recognition, multimedia analysis,representation, understanding, and retrieval. He wasthe program cochair for the 2003 InternationalConference on Image and Video Retrieval. He is thecoauthor of the book Exploration of Visual Data

(Kluwer 2003). He was the recipient of eightscholarships and awards from Tsinghua Universityfrom 1988 to 1995. In 2001, he received the M.E.Van Valkenburg Fellowship Award, an award givento one or two PhD students in the ECE departmentof UIUC each year “for demonstrated excellence inresearch in the areas of circuits, systems, orcomputers. ” He is a member of the IEEE and theIEEE Computer Society.

Qi Tian received his Ph.D. degree in Electrical andComputer Engineering from University of Illinois atUrbana-Champaign (UIUC), Illinois in 2002. Hereceived his M.S. degree in Electrical and ComputerEngineering from Drexel University, Philadelphia,Pennsylvania, in 1996, and B.E. degree in ElectronicEngineering from Tsinghua University, China, in1992, respectively. He has been an AssistantProfessor in the Department of Computer Science atthe University of Texas at San Antonio (UTSA)since 2002 and an Adjunct Assistant Professor in theDepartment of Radiation Oncology at the Universityof Texas Health Science Center at San Antonio(UTHSCSA) since 2003. He has published over 50referred book chapters, journal and conferencepapers in these fields. He has served as co-chairs ofACM Multimedia Workshop MultimediaInformation Retrieval (2005) and SPIE InternetMultimedia Management Systems (2005), sessionchairs and PC members of a number of internationalconferences in computer vision and multimedia suchas ICME, ICPR, CIVR, MIR, and VCIP. He is aSenior Member of IEEE, and a Member of ACM.

Yong Rui is a Researcher in the Communicationand Collaboration Systems (CCS) group inMicrosoft Research, where he leads the MultimediaCollaboration team. Dr. Rui is a Senior Member ofIEEE and a Member of ACM. He is on the editorialboard of International Journal of Multimedia Toolsand Applications. He received his PhD fromUniversity of Illinois at Urbana-Champaign (UIUC).Dr. Rui’s research interests include computer vision,signal processing, machine learning, and theirapplications in communication, collaboration, andmultimedia systems. He has published one book(Exploration of Visual Data, Kluwer AcademicPublishers), six book chapters, and over sixtyreferred journal and conference papers in the aboveareas. Dr. Rui was on Program Committees ofACM Multimedia, IEEE CVPR, IEEE ECCV, IEEE

Page 13: Semantic Retrieval of Video - Lenovoresearch.lenovo.com/~yongrui/ps/SigProc06.pdfSemantic Retrieval of Video Ziyou Xiong1, Xiang Zhou2, Qi Tian3, Rui Yong4, and Thomas S. Huang5 Abstract:

ACCV, IEEE ICIP, IEEE ICASSP, IEEE ICME,SPIE ITCom, ICPR, CIVR, among others. He was aCo-Chair of IEEE International Workshop onMultimedia Technologies in E-Learning andCollaboration (WOMTEC) 2003, the Demo Chair ofACM Multimedia 2003, and a Co-Tutorial Chair ofACM Multimedia 2004. He was on NSF reviewpanel and National Academy of Engineering'sSymposium on Frontiers of Engineering foroutstanding researchers.

Thomas S. Huang received his B.S. Degree inElectrical Engineering from National TaiwanUniversity, Taipei, Taiwan, China; and his M.S. andSc.D. Degrees in Electrical Engineering from theMassachusetts Institute of Technology, Cambridge,Massachusetts. He was on the Faculty of theDepartment of Electrical Engineering at MIT from1963 to 1973; and on the Faculty of the School ofElectrical Engineering and Director of itsLaboratory for Information and Signal Processing atPurdue University from 1973 to 1980. In 1980, hejoined the University of Illinois at Urbana-Champaign, where he is now William L. EverittDistinguished Professor of Electrical and ComputerEngineering, and Research Professor at theCoordinated Science Laboratory, and Head of theImage Formation and Processing Group at theBeckman Institute for Advanced Science andTechnology and Co-Chair of the Institute's majorresearch theme Human Computer IntelligentInteraction. Dr. Huang's professional interests lie inthe broad area of information technology, especiallythe transmission and processing of multidimensionalsignals. He has published 14 books, and over 500papers in Network Theory, Digital Filtering, ImageProcessing, and Computer Vision. He is a Memberof the National Academy of Engineering; a ForeignMember of the Chinese Academies of Engineeringand Sciences; and a Fellow of the InternationalAssociation of Pattern Recognition, IEEE, and theOptical Society of American; and has received aGuggenheim Fellowship , an A.V. HumboldtFoundation Senior U.S. Scientist Award , and aFellowship from the Japan Association for thePromotion of Science . He received the IEEE SignalProcessing Society's Technical Achievement Awardin 1987, and the Society Award in 1991. He wasawarded the IEEE Third Millennium Medal in 2000.Also in 2000, he received the Honda LifetimeAchievement Award for "contributions to motionanalysis". In 2001, he received the IEEE Jack S.

Kilby Medal. In 2002, he received the King-Sun FuPrize, International Association of PatternRecognition; and the Pan Wen-Yuan OutstandingResearch Award. He is a Founding Editor of theInternational Journal Computer Vision, Graphics,and Image Processing; and Editor of the SpringerSeries in Information Sciences, published bySpringer Verlag.