Video Object Annotation, Navigation, and Compositiondanbgoldman.com/uw/papers/ivoa.uist08.pdf · Video Object Annotation, Navigation, and Composition Dan B Goldman1 Chris Gonterman2

Video Object Annotation, Navigation, and Composition

Dan B Goldman1 Chris Gonterman2 Brian Curless2 David Salesin1,2 Steven M. Seitz2

1Adobe Systems, Inc.801 N. 34th StreetSeattle, WA 98103

{dgoldman,salesin}@adobe.com

2Computer Science & EngineeringUniversity of WashingtonSeattle, WA 98105-4615

{gontech, curless, salesin, seitz}@cs.washington.edu

ABSTRACTWe explore the use of tracked 2D object motion to enablenovel approaches to interacting with video. These includemoving annotations, video navigation by direct manipulationof objects, and creating an image composite from multiplevideo frames. Features in the video are automatically trackedand grouped in an off-line preprocess that enables later inter-active manipulation. Examples of annotations include speechand thought balloons, video graffiti, path arrows, video hy-perlinks, and schematic storyboards. We also demonstrate adirect-manipulation interface for random frame access usingspatial constraints, and a drag-and-drop interface for assem-bling still images from videos. Taken together, our tools canbe employed in a variety of applications including film andvideo editing, visual tagging, and authoring rich media suchas hyperlinked video.

ACM Classification: H5.2 [Information interfaces and pre-sentation]: User Interfaces. - Interaction styles.

General terms: Design, Human Factors, Algorithms

Keywords: Video annotation, video navigation, video in-teraction, direct manipulation.

1 INTRODUCTIONPresent video interfaces demand that users think in terms offrames and timelines. Although frames and timelines areuseful notions for many editing operations, they are less wellsuited for other types of interactions with video. In manycases, users are likely to be more interested in the higherlevel components of a video such as motion, action, charac-ter, and story. For example, consider the problems of attach-ing a moving annotation to an object in a video, or finding amoment in a video in which an object is in a particular place,or of composing a still from multiple moments in a video.Each of these tasks revolves around objects in the video andtheir motion. But although it is possible to compute objectboundaries and to track object motion, present interfaces donot utilize this information for interaction.

In this paper we propose a framework using (2D) video ob-ject motion to enable novel approaches to user interaction.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.UIST’08, October 19–22, 2008, Monterey, CA, USA..

Copyright 2008 ACM 978-1-59593-975-3/08/10 ...$5.00.

Although the concept of object motion is not as high-level ascharacter and story, it is a mid-level aspect of video that webelieve is a crucial building block toward those higher-levelconcepts. We apply computer vision techniques to under-stand how various points move about the scene and segmentthis motion into independently moving objects. This infor-mation is used to simplify the interaction required to achievea variety of video manipulation tasks.

In particular, we propose novel interfaces for three tasks thatare under-served by present-day video interfaces: annotation,navigation, and image composition.

Video annotation. Video annotation is the task of associat-ing graphical objects with moving objects on the screen. Inexisting interactive applications, only still images can be an-notated, as in the “telestrator” system used in American foot-ball broadcasting (see Figure 1). Using our system, however,such annotations can easily be attached to moving objects inthe scene by novices with minimal user effort. Our systemsupports descriptive labels, illustrative sketches, thought andword bubbles communicating speech or intent, path arrowsindicating motion, and hyperlinked regions (Figures 3–7).Given the pre-computed object motion, the system can alsodetermine when objects are occluded or change appearancesignificantly, and modify the appearance of the annotationsaccordingly (Figure 8). We envision video object annota-tions being used in any field in which video is produced orused to communicate information.

Video navigation. The use of motion analysis also permitsnovel approaches to navigating video through direct manip-ulation. Typical approaches to navigating video utilize a lin-ear scan metaphor, such as a slider, timeline, or fast-forwardand rewind controls. However, using pre-computed objectmotion, the trajectories of objects in the scene can be em-ployed as constraints for direct-manipulation navigation: Inour system, clicking and dragging on objects in a video framecauses the video to immediately advance or rewind to a framein which the object is located close to the ending mouse po-sition. Contemporaneous work [7, 11] shows that this ap-proach streamlines video tasks requiring selection of individ-ual frames (for example, locating a “cut frame” on which tobegin or end a shot). Our system also enables a novel modeof re-animation of video sequences using the mouse as in-put: We can direct the motion of an object such as a face bydragging it around the screen. This interface can be appliedto “puppeteer” existing video, or to retime the playback of avideo.

Video-to-still composition. Finally, we discuss the task of

rearranging portions of a video to create a new still image.Agarwala et al. [1] have previously explored the problem ofcombining a set of stills into a single, seamless composite.However, when the source material is video, the task of iden-tifying the appropriate moments becomes a problem in and ofitself. Many consumer cameras now support a “burst mode”in which a high-speed series of high-resolution photos istaken. Finding Cartier-Bresson’s “decisive moment” in suchan image stream, or composing a photomontage from multi-ple moments, is a significant challenge. Rav-Acha et al. [19]have proposed editing videos using evolution of time-fronts,but have not presented a user interface for doing so. Wedemonstrate an interactive interface complementary to suchalgorithms, for composition of novel still images using adrag-and-drop metaphor to manipulate people or other ob-jects in a video.

To achieve these interactions, our system first analyzes thevideo in a fully automatic preprocessing step that tracks themotion of image points across the video and segments thosetracks into coherently moving groups. Although reliably ex-tracting objects in video and tracking them over many framesis a hard problem in computer vision, the manipulations wesupport do not require perfect object segmentation and track-ing, and can instead exploit low-level motion tracking andmid-level grouping information. Furthermore, aggregatingpoint motion into coherent groups has a number of benefitsthat are critical for interaction: We can select a moving re-gion using a single click, estimate object motion more ro-bustly, and, to a limited extent, handle changes of object ap-pearance, including occlusions. Our contributions include anautomated preprocess for video interaction that greatly en-riches the space of interactions possible with video, an intu-itive interface for creating graphical annotations that trans-form along with associated video objects, a fluid interactiontechnique for scrubbing through video with single or multipleconstraints, and a novel drag-and-drop approach to compos-ing video input into new images by combining regions frommultiple frames.

1.1 Related work

Figure 1: A typical telestra-tor illustration. ( c©Johntex,CC license [30])

The telestrator [30], pop-ularly known as a “JohnMadden-style whiteboard,”was invented by physicistLeonard Reiffel for draw-ing annotations on a TVscreen using a light pen.This approach has alsobeen adopted for individ-ual sports instruction usingsystems like ASTAR [3]that aid coaches in review-ing videos of athletic performance. However, as previouslymentioned, annotations created using a telestrator are typi-cally static, and do not overlay well on moving footage.

In recent years, broadcasts of many professional sportingevents have utilized systems supplied by Sportvision [28] tooverlay graphical information on the field of play even whilethe camera is moving. Sportvision uses a variety of tech-nologies to accomplish these overlays, including surveying,

calibration, and instrumentation of the field of play or rac-ing environment. Although this instrumentation and cali-bration allow graphics to be overlaid in real time during thebroadcast, it requires expensive specialized systems for eachdifferent class of sporting event, and is not applicable to pre-existing video acquired under unknown conditions.

Tracking has previously been used for video manipulationand authoring animations. For example, Agarwala et al. [2]demonstrated that an interactive keyframe-based contour track-ing system could be used for video manipulation and strokeanimation authoring. However, their system required consid-erable user intervention to perform tracking. In contrast, ourapplication does not require pixel-accurate tracking or ob-ject segmentation, so we can use more fully-automated tech-niques that do not produce pixel segmentations.

Our method utilizes the particle video approach of Sandand Teller [23] to densely track points in the video. Ob-ject tracking is a widely researched topic in computer vi-sion, and many other tracking approaches are possible; Yil-maz et al. [31] recently surveyed the state of the art. How-ever, particle video is especially well suited to interactivevideo applications because it provides a dense field of trackedpoints that can track fairly small objects, and even points infeatureless regions. An important advantage of the particlevideo approach over other methods is that it produces tracksthat are both spatially dense and temporally long-range.

Our grouping preprocess accomplishes some of the samegoals as the object grouping technique of Sivic et al. [26],which tracks features using affine-covariant feature match-ing and template tracking, followed by a grouping methodemploying co-occurrence of tracks in motion groups. Thatmethod has shown significant success at grouping differentviews of the same object even through deformations and sig-nificant lighting changes. However, after some experimenta-tion we found that it has several drawbacks for our applica-tion, which we discuss in depth in Section 6.

Balakrishnan and Ramos [18] also developed a system witha novel navigation and annotation interface to video. How-ever, their approach is based on a linear timeline model, andannotations apply only to individual frames in a video.

Thought and speech balloons have previously been employedin virtual worlds and chat rooms [15, 13], in which the as-sociated regions are known a priori. Kurlander et al. [13]specifically address the problem of balloon layout. How-ever, their system was free to assign the placement of boththe word balloons and the subjects speaking them, whereasour system is designed to associate thought and speech bal-loons with arbitrary moving video objects.

We are not the first to propose the notion of hyperlinkedvideo as described in Section 4.1. To our knowledge, theearliest reference of this is the Hypersoap project [6]. How-ever, the authoring tool proposed in that work required ex-tensive user annotation of many frames. Smith et al. [27]taxonomized hyperlinks in dynamic media, but their imple-mentation is limited to simple template tracking of whole ob-jects. We believe our system offers a significantly improvedauthoring environment for this type of rich media.

Numerous previous works discuss composition of a still ora short video from a longer video, for the purposes of mov-ing texture synthesis [25], modifying dialogue [4], animatingvideo sprites [24], or video summarization [19, 20, 16, 17].In this paper we advocate and demonstrate the approach ofdrag-and-drop manipulation as a tool for interactively direct-ing such compositions.

Our system features a novel interface for scrubbing throughvideo using direct manipulation of video objects. This tech-nique is similar in spirit to the storyboard-based scrubbingapproach of Goldman et al. [8], but permits manipulation di-rectly on the video frame, rather than on an auxiliary story-board image. The system of Goldman et al. required sev-eral minutes of user interaction to author a storyboard be-fore this manipulation can be performed. In contrast, ourapproach requires only automated preprocessing to enabledirect-manipulation navigation. Indeed, we propose that ournavigation system can be used as part of an authoring sys-tem to create such storyboards, as described in Section 4.3.Our video navigation mechanism can be used to re-animatevideo of a person, in much the same way as the spacetimefaces system [32], but without requiring a complex 3D shapeacquisition system.

Kimber et al. [12] introduced the notion of navigating avideo by directly manipulating objects within a video frame.They also demonstrate tracking and navigation across multi-ple cameras. However, their method uses static surveillancecameras and relies on whole object tracking, precluding nav-igation on multiple points of a deforming object such as wedemonstrate in Figure 10.

Contemporaneous works by Karrer et al. [11] and Dragice-vic et al. [7] use flow-based preprocessing to enable real-timeinteraction that is very similar to our approach described inSection 4.2. Their research demonstrates that direct manip-ulation video browsing permits significant performance im-provements for frame selection tasks.

However, our work advances this concept in four impor-tant ways: First and foremost, this paper presents a gen-eral framework enabling a number of different kinds of di-rect video manipulations, not only navigation. Second, ourpreprocessing method achieves the long-range accuracy ofobject-tracking [12] and feature-tracking methods [7], whilealso retaining the spatial resolution provided by optical-flow-based techniques [11]. Our use of motion grouping also im-proves the robustness of the system to partial occlusions, andmakes it possible to track points well even near the bound-aries of objects, points which might otherwise “slip” off ofone object onto another at some frame, causing an incorrecttrajectory. Third, our “starburst” manipulator widget effec-tively represents the space of both simple and complex objectmotions, whereas the motion path arrows employed in ear-lier work are most effective at representing simple, smoothmotion paths. The starburst widget is more effective in sit-uations such as Figure 10, in which it is more important toconvey the range of feasible motion than an object’s specificpath through the video (in this case, a spiral). Fourth, ourintroduction of multiple constraints enables simple access tomore elaborate object configurations, also as shown in Fig-

ure 10. Finally, we introduce an inertial slider mechanismallowing access to frames in which an object is off-screen.

In some respects, the approaches of Karrer et al. and Drag-icevic et al. have advantages over our framework. Most no-tably, both methods use a less expensive preprocessing ap-proach than ours, potentially making them more widely ap-plicable in practice. Both methods also include terms in theircost functions that discourage discontinuous jumps in time,which may be desirable for certain applications.

1.2 OverviewOur system consists of several off-line preprocessing stages(Section 2), followed by an interactive interface for videoselection that is common to all our techniques (Section 3).The subsequent section describes applications; our interfacesfor video annotation (Section 4.1), navigation (Section 4.2),and recomposition (Section 4.3).

2 PRE-PROCESSINGOur preprocessing consists of two phases. First, point par-ticles are placed and tracked over time (Section 2.1). Sec-ond, the system aggregates particles into consistent movinggroups (Section 2.2).

2.1 Particle trackingTo track particles, we apply the “particle video” long-rangepoint tracking method [23, 22], which takes as input a se-quence of frames and produces as output a dense cloud ofparticles, representing the motion of individual points in thescene throughout their range of visibility. Each particle has astarting and ending frame, and a 2D position for each framewithin that range.

The key advantage of the particle video approach over eithertemplate tracking or optical flow alone is that it is both spa-tially dense and temporally long-range. In contrast, featuretracking is long-range but spatially sparse, and optical flow isdense but temporally short-range. Thus, particle video data isideal for our applications, as we can approximate the motionof any pixel into any other frame by finding a nearby particle.

In the sections that follow, we will use the following nota-tion: A particle track i is represented by a 2D position xi(t)at each time t during its lifetime t ∈ T (i). The total numberof particles is denoted n.

2.2 Particle groupingFor certain annotation applications, we find it useful to esti-mate groups of points that move together over time. Our sys-tem estimates these groupings using a generative K-affinesmotion model, in which the motion of each particle i is gen-erated by one of K affine motions plus isotropic Gaussiannoise:

xi(t +Δt) = AL(i)(t)[xi(t)]+η (1)

Here Ak(t) represents the affine motion of group k from timet to time t +Δt, and η is zero-mean isotropic noise with stan-dard deviation σ . (For notational convenience we write Ak(t)as a general operator, rather than separating out the linear ma-trix multiplication and addition components.) In our system,Δt = 3 frames. Each particle is assigned a single group la-bel 1 ≤ L(i) ≤ K over its entire lifetime. The labels L(i) are

distributed with unknown probability P[L(i) = k] = πk. Wedenote group k as Gk = {i|L(i) = k}.

Our system optimizes for the maximum likelihood model

Θ = (A1, . . . ,AK ,π1, . . . ,πK ,L(1), . . . ,L(n)) (2)

using an EM-style alternating optimization. Given the abovegenerative model, the energy function Q can be computed as:

Q(Θ) = ∑i

∑t∈T (i)

(d(i, t)2σ2

− log(πL(i)))

(3)

where d(i, t) = ||xi(t + Δt)− AL(i)(t)[xi(t)]||2, the residualsquared error.

To compute the labels L(i), we begin by initializing them torandom integers between 1 and K. Then, the following stepsare iterated until Q converges:

• Affine motions Ak are estimated by a least-squares fit of anaffine motion to the particles Gk.

• Group probabilities πk are computed as the numbers ofparticles in each group, weighted by particle lifespan, thennormalized such that ∑k πk = 1.

• Labels L(i) are reassigned to the label that minimizes theobjective function Q(Θ) per particle, and the groups Gkare updated.

In the present algorithm we fix σ = 1 pixel, but this could beincluded as a variable in the optimization as well.

The output of the algorithm is a segmentation of the particlesinto K groups. Figure 2 illustrates this grouping on one ofour input datasets. Although there are some misclassifiedparticles, the bulk of the particles are properly grouped. Ourinteractive selection interface, described in Section 3, can beused to overcome the minor misclassifications seen here.

Figure 2: Four frames from a video sequence, with particlescolored according to affine groupings computed in Section 2.2.(video footage c©2005 Jon Goldman)

3 VIDEO SELECTIONSeveral of our interactions require the selection of an objectin the video, either for annotation or direct manipulation ofthat object. The object specification task is closely relatedto the problems of interactive video segmentation and videomatting [2, 5, 14, 29]. However, these other methods areprincipally concerned with recovering a precise matte or sil-houette of the object being selected. In contrast, the goal ofour system is to recover the motion of a particular object orregion of the image, without concern for the precise locationof its boundaries.

In our system, the user selects video objects simply by paint-ing a stroke or dragging a rectangle over a region Rs of theimage. This region is called the anchor region, and the frameon which it is drawn is called the anchor frame, denoted ts.The anchor region defines a set of anchor tracks. For someapplications, it suffices to define the anchor tracks M(s) asthe set of all particles on the anchor frame that lie within theanchor region:

M(s) = {i|ts ∈ T (i),xi(ts) ∈ Rs} (4)

However, this simplistic approach to selecting anchor tracksrequires the user to scribble over a potentially large anchorregion. The system can reduce the amount of user effort byemploying the particle groupings computed in section 2.2.Our interface uses the group labels of the particles in the an-chor region to infer entire group selections, rather than indi-vidual particle selections. To this end, the system supportstwo modes of object selection. First, the user clicks once toselect the group of points of which the closest track is a mem-ber. The closest track i∗ to point x0 on frame t0 is located as:

i∗(x0, t0) = argmin{i|t0∈T (i)}‖x0 −xi(t0)‖, (5)

and the selected group is simply GL(i∗(x,t)). Second, the userpaints a “sloppy” selection that includes points from multiplegroups. The resulting selection consists of the groups that arewell represented in the anchor region. Each group is scoredaccording to the number |Mk(s)| of its particles in the anchorregion s. Then any group whose score is a significant fractionTG of the highest scoring group is accepted:

Mk(s) = Gk ∩M(s) ∀1 ≤ k ≤ K (6)

Sk(s) =|Mk(s)|

max1≤k≤K |Mk(s)| (7)

G(s) =⋃

{k|Sk(s)≥TG}Gk (8)

The threshold TG is a system constant that controls the selec-tion precision. When TG = 1, only the highest-scoring groupargmaxk|Mk(s)| is selected. As TG approaches 0, any groupwith a particle in the selected region will be selected in its en-tirety. We have found that TG = 0.5 generally gives intuitiveresults.

Our system’s affine grouping mechanism may group par-ticles together that are spatially discontiguous. However,discontiguous regions are not always desired for annotation

or manipulation. To address this, only the particles thatare spatially contiguous to the anchor region are selected.This effect is achieved using a connected-components searchthrough a pre-computed Delaunay triangulation of the parti-cles on the anchor frame.

Both the group-scoring formula and connected componentssearch take only a fraction of a second, and can therefore berecomputed on the fly while a paint stroke is in progress inorder to give visual feedback to the user about the regionsbeing selected.

By allowing more than one group to be selected, the usercan easily correct the case of over-segmentation, such thatconnected objects with slightly different motion may havebeen placed in separate groups. In the case of annotation, ifa user selects groups that move independently, the attachedannotations will simply be a “best fit” to both motions.

Using groups of particles confers several advantages over in-dependent particles. As previously mentioned, user interac-tion is streamlined by employing a single click or a loose se-lection to indicate a moving object with complex shape andtrajectory. Furthermore, the large number of particles in theobject groupings can be used to compute more robust motionestimates for rigidly moving objects. The system can alsoannotate objects even on frames where the original anchortracks M(s) no longer exist due to partial occlusions or de-formations. (However, our method is not robust to the case inwhich an entire group is occluded and later becomes visibleagain. This is a topic for future work.)

4 APPLICATIONSOur interactive interface is patterned after typical drawingand painting applications, with the addition of a video time-line. The user is presented with a video window, in which thevideo can be scrubbed back and forth using either a slider ora direct manipulation interface described in Section 4.2. Atoolbox provides access to a number of different types of an-notations and interaction tool modes, which are applied usingdirect manipulation in the video window.

4.1 Video annotationsTelestrator-like markup of video can be useful not only forsports broadcasting but also for medical applications, surveil-lance video, and instructional video. Film and video profes-sionals can use annotations to communicate editorial infor-mation about footage in post-production, such as objects tobe eliminated or augmented with visual effects. Annotationscan also be used to modify existing video footage for enter-tainment purposes with speech and thought balloons, virtualgraffiti, “pop-up video” notes, and other arbitrary signage.

In this Section, we describe our system’s support for fivetypes of graphical annotations, distinguished by their shapeand by the type of transformations (e.g., translation / homog-raphy / similarity) with which they follow the scene. An an-notation’s motion and appearance is determined by the mo-tion of its anchor tracks: Transformations between the anchorframe and other frames are computed using point correspon-dences between the features on each frame. Some transfor-mations require a minimum number of correspondences, soif there are too few correspondences on a given frame — for

example because the entire group is occluded — the annota-tion is not shown on that frame.

At present, we have implemented prototype versions of “graf-fiti,” “scribbles,” “speech balloons,” “path arrows,” and “hy-perlinks.”

Figure 3: Graffiti

Graffiti. These annotationsinherit a perspective defor-mation from their anchortracks, as if they are paintedon a planar surface such as awall or ground plane. Givenfour or more non-collinearpoint correspondences, a ho-mography is computed usingthe normalized direct lineartransformation method de-scribed by Hartley and Zis-serman [10].

Depending on the length ofthe video sequence, it may betime-consuming to compute the transformations of an anno-tation for all frames. Therefore, after the user completes thedrawing of the anchor regions, the transformations of graffitiannotations are not computed for all frames immediately, butare lazily evaluated as the user visits other frames. Furtherimprovements to perceived interaction speed are possible byperforming the computations during idle callbacks betweenuser interactions.

Figure 4: Scribbles(footage c©2005 JonGoldman)

Scribbles. These simpletyped or sketched annota-tions just translate along withthe mean translation of an-chor tracks. This annota-tion is ideal for simple com-municative tasks, such as lo-cal or remote discussions be-tween collaborators in filmand video production.

Figure 5: Word bal-loons (footage c©2005Jon Goldman)

Word balloons. Inspired byComic Chat [13] and Ros-ten et al. [21], our system im-plements speech and thoughtballoons that reside at a fixedlocation on the screen, witha “tail” that follows the an-notated object. The locationof the speech balloon is op-timized to avoid overlap withforeground objects and otherspeech balloons, while remaining close to the anchor tracks.

Path arrows. These annotations highlight a particular mov-ing object, displaying an arrow indicating its motion onto aplane in the scene. To compute the arrow path we transformthe motion of the centroid of the anchor tracks into the coor-dinate system of the background group in each frame. Theresulting path is used to draw an arrow that transforms alongwith the background. By clipping the path to the current

Figure 6: Path arrows

frame’s edges, the arrowhead and tail can remain vis-ible on every frame, mak-ing it easy to determine thesubject’s direction of motionat any time. We be-lieve this type of annotationcould be used by surveil-lance analysts, or to enhancetelestrator-style markup ofsporting events.

Video hyperlinks. Our sys-tem can also be used to au-thor dynamic regions of avideo that respond to interac-tive mouse movements, enabling the creation of hyperlinkedvideo [6]. A prototype hyper-video player using our systemas a front end for annotation is shown in Figure 7. Whenviewing the video on an appropriate device, the user can ob-tain additional information about annotated objects, for ex-ample, obtaining purchase information for clothing, or addi-tional references for historically or scientifically interestingobjects. As a hyperlink, this additional information does notobscure the video content under normal viewing conditions,but rather allows the viewer to actively choose to obtain fur-ther information when desired. The hyperlinked regions inthis 30-second segment of video were annotated using ourinterface in about 5 minutes of user time.

Marking occlusions. When an object being annotated is par-tially occluded, our system can modify an associated anno-tation’s appearance or location, either to explicitly indicatethe occlusion or to move the annotation to an un-occludedregion. One indication of occlusion is that the tracked parti-cles in the occluded region are terminated. Although this isa reliable indicator of occlusion, it does not help determinewhen the same points on the object are disoccluded, sincethe newly spawned particles in the disoccluded region arenot the same as the particles that were terminated when theocclusion occurred. Here again we are aided by the group-

Figure 7: A video with highlighted hyperlinks to web pages. (videofootage c©2005 Jon Goldman)

Figure 8: A rectangle created on the first frame sticks to thebackground even when its anchor region is partially or completelyoccluded. The annotation changes from yellow to red to indicateocclusion.

ing mechanism, since it associates these points on either sideof the occlusion as long as there are other particles in thegroup to bridge the two sets. To determine if a region of theimage instantiated at one frame is occluded at some otherframe, the system simply computes the fraction of particlesin the transformed region that belong to the groups presentin the initial frame. An annotation is determined to be oc-cluded if fewer than half of the points in the region belong tothe originally-selected groups. Figure 8 shows a rectangularannotation changing color as it is repeatedly occluded anddisoccluded.

4.2 Video navigation using direct manipulationGiven densely tracked video, we can scrub to a differentframe of the video by directly clicking and dragging on mov-ing objects in the scene. We have implemented two varia-tions on this idea: The first uses individual mouse clicks anddrags, and the second uses multiple gestures in sequence.

The single-drag UI is implemented as follows: When the userclicks at location x0 while on frame t0, the closest track i∗is computed as in equation (5), and the offset between themouse position and the track location is retained for futureuse: d = x0 − xi∗(t0). Then, as the user drags the mouse toposition x1, the video is scrubbed to the frame t∗ in which theoffset mouse position x1 + d is closest to the track positionon that frame:

t∗ = argmin{t∈T (i∗)}‖x1 +d−xi∗(t)‖ (9)

Since this action mimics the behaviour of a traditional scroll-bar or slider, we call it a “virtual slider.” Figures 9 and 10(a-c) illustrate this behavior.

x0x1

x +d1d1

Track A

Track B

2 3

45

5 6

7

8

Figure 9: When the mouse is depressed at position x0 on frame2, track A (shown in red) is selected as the virtual slider track.When the mouse moves to position x1, the location of that trackat frame 3 is closer to the offset mouse position x1 + d, so thevideo is scrubbed to frame 3. Track A ends at frame 5, but thevirtual slider is extended to later frames using the next closesttrack (track B, shown in blue).

This approach can cause discontinuous jumps in time whenmanipulating objects that travel away from and then back totheir original screen locations. Such jumps can be eliminatedby computing distances in (x,y, t) [11] or (x,y,arc-length) [7]spaces, as demonstrated in contemporaneous work. How-ever, the discontinuous jumps can actually be preferable insituations such as that shown in Figure 10, in which the usersearches for a scene configuration without attending to ex-actly when in the video that configuration occurred. In suchsituations, ignoring temporal continuity gives more salientresults and more immediate feedback.

Visualizing range of motion. To represent the range of mo-tion afforded by a virtual slider, we can display it as a lineor path arrow overlaid on the image. However, for complexmovements this representation may be confusing. There-fore, we have developed a novel representation for displayingrange of motion: A “starburst” widget is placed at the con-straint location such that the length of the starburst’s spikesrepresent the spatial proximity of nearby frames on the slider.For simple linear motions the starburst widget simply lookslike a compass with one arm pointing to the direction ofmovement forwards in time and another pointing to the di-rection of movement backwards in time. However, for morecomplex motions additional arms begin to appear, indicatingthat portions of the range of motion lie in other directions.

To display the starburst widget, we first bin the slider posi-tions according to their orientation from the current positionxi∗(t0). Then, a weight hθ is computed for each arm using allthe positions in bin Bθ :

vt = xi∗(t) −xi∗(t0) (10)

hθ = ∑{t|xi∗(t)∈Bθ }

exp(−||vt ||2/σ2

h)

(11)

The length lθ of the starburst arm in direction θ is then givenby normalizing and scaling the weights:

lθ = k√

hθ /∑ω

hω (12)

where k controls the absolute scaling of the starburst. Longerarms are given an arrowhead, and bins with no points do notdisplay any arm at all. This formula is designed to weight thetrack locations by a Gaussian distance falloff, so that nearbypoints on the slider are weighted more heavily than distantones. Since the binning described above can cause aliasingfor slider paths that lie just along the boundaries of the ori-entation bins, we adjust the orientation of the bins on eachframe using PCA, centering a bin on the dominant directionof the slider in the local region. Examples of the starburstmanipulators in context can be see in Figures 10 and 11.

Handling occlusions. Because tracks can start and end onarbitrary frames, it is not always possible to reach a de-sired frame using a virtual slider constructed from a singletrack. Therefore we extend the virtual sliders using multi-ple tracks, as follows: If the last frame of the selected tracktmax = max [t ∈ T (i∗)] is not the end of the video, and the trackis not at the edge of the frame, the system finds the next clos-est track j∗ on frame tmax in the same group that extends at

least to tmax +1. The portion of this track that extends beyondframe tmax is then appended to the virtual slider, offset by thedisplacement xi∗(tmax)− x j∗(tmax) between tracks in order toavoid a discontinuity (see Figure 9). This process is contin-ued forwards until either the end of the video is reached, orthe virtual slider reaches the edge of the frame, or the next-closest track is too distant. The process is then repeated inreverse for the first frame of the track, continuing backwards.

In most cases this algorithm successfully extends the virtualslider throughout the visible range of the object’s motion, in-cluding partial occlusions. However, our system does nottrack objects through full occlusions, so the virtual slider typ-ically terminates when the object becomes fully occluded.In order to overcome short temporary occlusions, we haveadded “inertia” to our virtual sliders, inspired by the inertialscrolling feature of Apple’s iPhone: During mouse motionwe track the temporal velocity of the virtual slider in framesper second. When the mouse button is released, this veloc-ity decays over a few seconds rather than instantaneously. Inthis way, the user can reach frames just beyond the end ofthe virtual slider by “flicking” the mouse toward the end ofthe slider. This feature complements the accurate frame se-lection mechanism by providing a looser long-distance frameselection mechanism.

Some virtual sliders may have paths that fold back on them-selves, or even spiral around, so that a linear motion of themouse will jump nonlinearly to different frames. In this case,the temporal velocity of the slider when the mouse is releasedis not meaningful. Instead, we can retain the spatial con-straint for several seconds, advancing its location using thespatial velocity of the mouse pointer before the button wasreleased.

Unfortunately, spatial inertia cannot be used in all cases, be-cause unlike temporal inertia, there is no way to reach pastthe end of the virtual sliders: Tracks do not extend beyond theboundaries of the video frame. Therefore our virtual slidersautomatically choose between temporal inertia and spatial in-ertia by examining the predicted next position of the virtualslider: If they are in similar directions, temporal inertia isused. If they are in different directions, spatial inertia is used.We have found this heuristic tends to perform intuitively inmost cases.

Multiple constraints. By extending the scrubbing techniquedescribed above to multiple particle paths, we also imple-ment a form of constraint-based video control. The user setsmultiple point constraints on different parts of the video im-age, and the video is advanced or rewound to the frame thatminimizes the maximum distance from each particle to itsconstraint position. Here, c indexes over constraint locationsxc, offsets dc, and constrained particles i∗c , and C is the set ofall constraints:

t∗ = argmin{t∈T (i∗)} maxc∈C

‖xc +dc −xi∗c (t)‖ (13)

In our current mouse-based implementation, the user cansequentially set a number of fixed-location constraints andone dynamic constraint, controlled by the mouse. However,multiple dynamic constraints could be applied using a multi-

(a) (b) (c) (d) (e)

Figure 10: Our scrubbing interface is used to interactively manipulate a video of a moving head by dragging it to the left and right (a-c).Nearby frames are indicated by longer starburst arms. At the extremes of motion (c), some arms of the starburst disappear. Additionalconstraints are applied to open the mouth (d), or keep the mouth closed and smile (e).

touch input device. Figures 10(d) and 10(e) illustrate facialanimation using multiple constraints.

To balance multiple constraints, we employ a max functionin Equation 13 instead of a sum, because this function hasoptima at which the error for two opposing constraints isequal. As a simple example, consider two objects in a 1-dimensional scene that travel away from the origin at twodifferent speeds: xa(t) = vat,xb(t) = vbt,va = vb. Now imag-ine that these objects are constrained to positions x∗a and x∗brespectively. If we were to use a sum of absolute values|vat − x∗a|+ |vbt − x∗b| as our cost function, it would be mini-mized at either t = x∗a/va or t = x∗b/vb: One of the constraintswill be met exactly while the other is ignored! Using thesum of squared errors for our cost function does not solve thisproblem, as it also meets the constraints unequally. However,by using the maximum distance in our cost function, the er-ror is made equal for the two most competing constraints atthe optimum.

4.3 Video-to-still compositionAnother application enabled by our system is the seamlessrecomposition of a video into a single still image featuringparts of several different frames of the video. For exam-ple, one might wish to compose a still image from a shortvideo sequence featuring multiple participants. Althoughthere may be frames of the video in which one or two sub-jects are in frame, looking at the camera, and smiling, theremay be no one frame in which all of the subjects look good.However, by using different frames of the video for differentsubjects we may be able to compose a single still frame that isbetter than any individual video frame in the sequence. Priorwork [1] assumes that the number of frames is small enoughthat a user can examine each frame individually to find thebest frame for each subject. Such interfaces are thereforeless appropriate for video input.

In our system, we take a drag-and-drop approach to videorecomposition. Virtual sliders, as described in the previoussection, are used to navigate through the video. In addition,we display only the video object under the mouse at the newframe, and the rest of the objects are “frozen” at their previ-ous frame. This is accomplished as follows:

First, if the camera is moving, we estimate the motion ofthe background as a homography, using the group with thelargest number of tracked points to approximate the back-ground. Subsequently, when other frames are displayed, theyare registered to the current frame using the estimated back-ground motion, and all virtual slider paths are computed us-

ing the registered coordinate frame1. As the mouse buttonis dragged and the frame changes, a rough mask of the ob-ject being dragged is computed using the same connected-components search described in Section 3. This mask is usedto composite the contents of the changing frame over the pre-viously selected frame. In this way, these regions appear toadvance or rewind through time and the object moves awayfrom its previous location. We also composite the new framecontents within the matte of the object at the frame uponwhich the mouse button was depressed, in order to removeits original appearance from that region of the image. Al-though this is not guaranteed to show the proper contents ofthat image region (for example, a different object may haveentered that region at the new frame), it is simple enough torun at interactive rates and usually produces plausible results.When the mouse button is released, a final composite is com-puted using graph cut compositing [1] with the object maskas a seed region, removing most remaining artifacts. An il-lustration of drag-and-drop video recomposition is shown inFigure 11.

Our system also supports another type of still compositionusing video; the schematic storyboards described by Gold-man et al. [8]. In that work, the user selected keyframesfrom an exhaustive display of all the frames in the video, andmanually selected and labeled corresponding points in eachof those frames. In contrast, our interface is extremely sim-ple: The user navigates forward and backward in the videousing either a standard timeline slider or the “virtual slider”interface, and presses a “hot key” to assign the current frameas a storyboard keyframe. When the first keyframe is se-lected, the interface enters “storyboard mode,” in which thecurrent frame is overlaid atop the storyboard as a ghosted im-age, in the proper extended frame coordinate system. Chang-ing frames causes this ghosted image to move about the win-dow, panning as the camera moves. The user can add newkeyframes at any time, and the layout is automatically re-computed. As described by Goldman et al., the storyboardis automatically split into multiple extended frames as neces-sary, and arrows can be associated with moving objects usingthe object selection and annotation mechanisms already de-scribed. The resulting interaction is much easier to use thanthe original Goldman et al. storyboards work.

5 INFORMAL EVALUATIONWe believe the tools we have demonstrated are largely uniqueto our system. However, it is possible to associate annota-tions to video objects using some visual effects and motion

1This is identical to the “relative flow dragging” described by Dragice-vic et al. [7]

Figure 11: Drag-and-drop composition of a still from video. The black car is to the right of the white car on all frames of the input video. Fromleft to right, the black car is dragged from its location on frame 1 to a new location on frame 66. The first three panels show the appearanceduring the drag interaction, and the fourth panel shows the final graph cut composite, which takes about 5 seconds to compute. Unlike atraditional image cut/paste, the black car appears larger as it is dragged leftwards, because it is being extracted from a later video frame.

graphics software. We asked a novice user of Adobe Af-ter Effects – a commercial motion graphics and compositingtool with a tracking module – to create a single translating“scribble” text annotation on the car shown in Figure 8. Withthe assistance of a slightly more experienced user, the novicespent over 20 minutes preparing this test in After Effects.

However, since After Effects is targeted at professionals, wealso asked an expert user – a co-inventor and architect of Af-ter Effects – to perform the same annotation task using bothAfter Effects and our “graffiti” tool. We gave him a quick 1minute introduction to our tool, then asked him to annotateboth the car and the crosswalk stripe with a deforming textlabel, as shown in Figure 3. Using our tool, which he hadnot previously used, he completed the task in just 30 secondsusing 10 mouse clicks and 12 keystrokes. Using After Ef-fects, the same task took 7 minutes and 51 seconds, usingover 74 mouse clicks, 52 click-and-drags, and 38 keystrokes.In a second trial he performed the same task in 3 minutes and55 seconds, using over 32 mouse clicks, 36 click-and-drags,and 34 keystrokes. Although After Effects offers more op-tions for control of the final output, our system is an order ofmagnitude faster to use for annotation because we performtracking beforehand, and we only require the user to gestu-rally indicate video objects, annotation styles and contents.

We also assessed the usability and operating range of oursystem by using it ourselves to create storyboards for ev-ery shot in the 9-minute short film, “Kind of a Blur” [9].The main body of the film (not including head and tail creditsequences) was manually segmented into 89 separate shots.Of these, 25 were representable as single-frame storyboardswith no annotations. For the remaining 64 shots, we prepro-cessed progressive 720× 480 video frames, originally cap-tured on DV with 4:1:1 chroma compression. Of the 64shots attempted, 35 shots resulted in complete and accept-able storyboards. The remaining 29 were not completed sat-isfactorily for the following reasons (some shots had multipleproblems). The particle video algorithm failed significantlyon seventeen shots: Eleven due to motion blur, three due tolarge-scale occlusions by foreground objects, and three dueto moving objects too small to be properly resolved. Of theremaining twelve shots, the grouping algorithm failed to con-verge properly for five shots. Six shots included some kindof turning or rotating object, but our system only effectivelyannotates translating objects. Nine shots were successfullypre-processed, but would require additional types of annota-tions to be effectively represented using storyboard notation.Although additional work is needed to expand our system’soperating range, these results show promise for our approach.

5.1 LimitationsOne important limitation of our system is the length of timerequired to preprocess video. In our current implementation,the preprocess takes up to 5 minutes per frame for 720×480input video, which is prohibitive for some of the potentialapplications described here. Although most of the preprocessis highly parallelizable, novel algorithms would be necessaryfor applications requiring “instant replay.”

Many of the constraints on our method’s operating rangeare inherited from the constraints on the underlying parti-cle video method. This approach works best on sequenceswith large textured objects moving relatively slowly. Smallmoving objects are hard to resolve, and fast motion intro-duces motion blur, causing particles to slip across occlusionboundaries. Although the particle video algorithm is rela-tively robust to small featureless regions, it can hallucinatecoherent motion in large featureless regions, which may beinterpreted as separate groups in the motion grouping stage.

Another drawback is that when a video features objects withrepetitive or small screen-space motions — like a subjectmoving back and forth along a single path, or moving di-rectly toward the camera — it may be hard or impossible toreach a desired frame using the cost function described inEquation 9. Other cost functions have been proposed to inferthe user’s intent in such cases [12, 7, 11].

6 DISCUSSIONWe have presented a system for interactively associatinggraphical annotations to independently moving video ob-jects, navigating through video using the screen-space mo-tion of objects in the scene, and composing new still framesfrom video input using a drag-and-drop metaphor. Our con-tributions include the application of an automated preprocessfor video interaction, a fluid interface for creating graphicalannotations that transform along with associated video ob-jects, and novel interaction techniques for scrubbing throughvideo and recomposing frames of video. The assembly ofa well-integrated system enabling new approaches to videomarkup and interaction is, in itself, an important contribu-tion. Our system performs all tracking and grouping of-fline before the user begins interaction, and our user inter-face hides the complexity of the algorithms, freeing the userto think in terms of high-level goals such as the placementsof objects and the types and contents of their annotations,rather than low-level details of tracking and segmentation.

We believe our preprocessing method is uniquely well-suitedto our applications. In particular, the long-range stabilityof the particle video tracks simplifies the motion groupingalgorithm with respect to competing techniques: We hadinitially implemented the feature-based approach described

by Sivic et al. [26], but encountered several important draw-backs of their approach for our interaction methods. First,the field of tracked and grouped points is relatively sparse,especially in featureless areas of the image. This scarcity offeatures makes them less suitable for the types of interactiveapplications that we demonstrate. Second, affine-covariantfeature regions are sometimes quite large, and may thereforeoverlap multiple moving objects. (Both of these drawbacksalso are problematic in the method of Dragicevic et al. [7].)However, a drawback of our grouping method is that it doesrequire the number of motion groups to be specified a priori.

Although they work remarkably well, we do not consider ourtracking and grouping algorithms to be a central contributionof this work. However, we find it notable that so much inter-action is enabled by such a straightforward preprocess. Wehave no doubt that future developments in computer visionwill improve upon our results, and we hope that researcherswill consider user interfaces such as ours to be an importantnew motivation for such algorithms.

In conclusion, we believe interactive video object manipula-tion can become an important tool for augmenting video asan informational and interactive medium, and we hope thisresearch has advanced us several steps closer to that goal.

AcknowledgmentsThe authors thank Peter Sand for the use of his source code,Sameer Agarwal and Josef Sivic for discussions about mo-tion grouping, Nick Irving at UW Intercollegiate Athleticsfor sample video footage, and Harshdeep Singh, SamreenDhillon, Kevin Chiu, Mira Dontcheva and Sameer Agarwalfor additional technical assistance. Special thanks to JonGoldman for the use of footage from his short film Kind ofa Blur [9]. Amy Helmuth, Jay Howard, and Lauren Mayoappear in our video. Funding for this research was providedby NSF grant EIA-0321235, the University of WashingtonAnimation Research Labs, the Washington Research Foun-dation, Adobe, and Microsoft.

References

1. A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Col-burn, B. Curless, D. Salesin, and M. Cohen. Interactive dig-ital photomontage. ACM Trans. Graph. (Proc. SIGGRAPH),23(4):294–301, 2004.

2. A. Agarwala, A. Hertzmann, D. H. Salesin, and S. M. Seitz.Keyframe-based tracking for rotoscoping and animation. ACMTrans. Graph. (Proc. SIGGRAPH), 23(3):584–591, 2004.

3. ASTAR Learning Systems. http://www.astarls.com,2006. [Online; accessed 5-January-2008].

4. C. Bregler, M. Covell, and M. Slaney. Video rewrite: Drivingvisual speech with audio. In Proc. SIGGRAPH 97, pages 353–360, 1997.

5. Y.-Y. Chuang, A. Agarwala, B. Curless, D. H. Salesin, andR. Szeliski. Video matting of complex scenes. ACM Trans.Graph., 21(3):243–248, 2002.

6. J. Dakss, S. Agamanolis, E. Chalom, and V. M. Bove Jr. Hyper-linked video. In Proc. SPIE, volume 3528, pages 2–10, 1999.

7. P. Dragicevic, G. Ramos, J. Bibliowicz, D. Nowrouzezahrai,R. Balakrishnan, and K. Singh. Video browsing by direct ma-nipulation. In CHI, pages 237–246, 2008.

8. D. B Goldman, B. Curless, S. M. Seitz, and D. Salesin.Schematic storyboarding for video visualization and editing.ACM Trans. Graph. (Proc. SIGGRAPH), 25(3):862–871, 2006.

9. J. Goldman. Kind of a Blur. http://phobos.apple.com/WebObjects/MZStore.woa/wa/viewMovie?id=197994758, 2005. [Short film available online; accessed5-January-2008].

10. R. I. Hartley and A. Zisserman. Multiple View Geometryin Computer Vision, page 109. Cambridge University Press,ISBN: 0521540518, second edition, 2004.

11. T. Karrer, M. Weiss, E. Lee, and J. Borchers. DRAGON: Adirect manipulation interface for frame-accurate in-scene videonavigation. In CHI, pages 247–250, 2008.

12. D. Kimber, T. Dunnigan, A. Girgensohn, F. Shipman, T. Turner,and T. Yang. Trailblazing: Video playback control by directobject manipulation. In ICME, pages 1015–1018, 2007.

13. D. Kurlander, T. Skelly, and D. Salesin. Comic chat. In SIG-GRAPH ’96, pages 225–236, 1996.

14. Y. Li, J. Sun, and H.-Y. Shum. Video object cut and paste. ACMTrans. Graph. (Proc. SIGGRAPH), 24(3):595–600, 2005.

15. C. Morningstar and R. F. Farmer. The lessons of Lucasfilm’sHabitat. In M. Benedikt, editor, Cyberspace: First Steps, pages273–301. MIT Press, Cambridge, MA, 1991.

16. Y. Pritch, A. Rav-Acha, A. Gutman, and S. Peleg. Webcamsynopsis: Peeking around the world. In Proc. ICCV, pages 1–8, 2007.

17. Y. Pritch, A. Rav-Acha, and S. Peleg. Non-chronological videosynopsis and indexing. IEEE Trans. PAMI, 2008. (to appear).

18. G. Ramos and R. Balakrishnan. Fluid interaction techniquesfor the control and annotation of digital video. In Proc. UIST’03, pages 105–114, 2003.

19. A. Rav-Acha, Y. Pritch, D. Lischinski, and S. Peleg. Dynamo-saicing: Video mosaics with non-chronological time. In Proc.CVPR, pages 58–65, 2005.

20. A. Rav-Acha, Y. Pritch, and S. Peleg. Making a long videoshort: Dynamic video synopsis. In Proc. CVPR, pages 435–441, 2006.

21. E. Rosten, G. Reitmayr, and T. Drummond. Real-time videoannotations for augmented reality. In Proc. Intl. Symp. on Vi-sual Computing, 2005.

22. P. Sand. Long-Range Video Motion Estimation using Point Tra-jectories. PhD thesis, MIT, 2006.

23. P. Sand and S. Teller. Particle video: Long-range motion es-timation using point trajectories. In Proc. CVPR ’06, pages2195–2202, 2006.

24. A. Schodl and I. A. Essa. Controlled animation of video sprites.In Proc. ACM/Eurographics Symp. on Comp. Animation, pages121–127, 2002.

25. A. Schodl, R. Szeliski, D. H. Salesin, and I. Essa. Video tex-tures. In SIGGRAPH ’00, pages 489–498, 2000.

26. J. Sivic, F. Schaffalitzky, and A. Zisserman. Object level group-ing for video shots. Intl. J. of Comp. Vis., 67(2):189–210, 2006.

27. J. M. Smith, D. Stotts, and S.-U. Kum. An orthogonal tax-onomy for hyperlink anchor generation in video streams usingOvalTine. In Proc. ACM Conf. on Hypertext and Hypermedia,pages 11–18, 2000.

28. Sportvision. Changing The Game. http://www.sportvision.com, 2006. [Online; accessed 5-January-2008].

29. J. Wang, P. Bhat, R. A. Colburn, M. Agrawala, and M. F. Co-hen. Interactive video cutout. ACM Trans. Graph. (Proc. SIG-GRAPH), 24(3):585–594, 2005.

30. Wikipedia. Telestrator. http://en.wikipedia.org/w/index.php?title=Telestrator&oldid=180785495,2006. [Online; accessed 5-January-2008].

31. A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey.ACM Computing Surveys, 38(4):13, December 2006.

32. L. Zhang, N. Snavely, B. Curless, and S. Seitz. Spacetimefaces: high resolution capture for modeling and animation.ACM Trans. Graph. (Proc. SIGGRAPH), 23(3):548–558, 2004.

Documents

Video Object Annotation, Navigation, and Compositiondanbgoldman.com/uw/papers/ivoa.uist08.pdf · Video Object Annotation, Navigation, and Composition Dan B Goldman1 Chris Gonterman2