12345efghi - cs.swan.ac.uk

12345efghi

UNIVERSITY OF WALES SWANSEA

REPORT SERIES

Visual Signatures in Video Visualization

by

Min Chen, Ralf P. Botchen, Rudy R. Hashim, Daniel Weiskopf, Thomas Ertl and Ian Thornto

Report # CSR 19-2005

TECHNICAL REPORT CSR-19-2005 (NOVEMBER 2005) 1

Visual Signatures in Video VisualizationMin Chen, Ralf P. Botchen, Rudy R. Hashim, Daniel Weiskopf Member, IEEE,

Thomas Ertl Member, IEEE, and Ian Thornton

Abstract— Video visualization is a computation processthat extracts meaningful information from original videodata sets and conveys the extracted information to usersin appropriate visual representations. This paper presentsa broad treatment of the subject, following a typicalresearch pipeline involving concept formulation, systemdevelopment, a path-finding user study and a field trialwith real application data. In particular, we have con-ducted a fundamental study on the visualization of motionevents in videos. We have, for the first time, deployed flowvisualization techniques in video visualization. We havecompared the effectiveness of several different abstractvisual representations of videos. We have conducted auser study to examine whether users are able to learnto recognize visual signatures of motions, and to assistin the evaluation of different visualization techniques.We have applied our understanding and the developedtechniques to a set of application video clips. Our study hasdemonstrated that video visualization is both technicallyfeasible and cost-effective. It provided the first set ofevidence confirming that ordinary users can accustom tothe visual features depicted in video visualizations, and canlearn to recognize visual signatures of a variety of motionevents.

Index Terms— video visualization, volume visualization,flow visualization, human factors, user study, visual signa-tures, video processing, optical flow, GPU-rendering.

I. INTRODUCTION

A video is a piece of ordered sequential data, andviewing videos is a time-consuming and resource-

consuming process. Video visualization is a computationprocess that extracts meaningful information from origi-nal video data sets and conveys the extracted information

M. Chen and R. R. Hashim are with Department of ComputerScience, University of Wales Swansea, Singleton Park, Swansea SA28PP, UK. Email: {m.chen,csrudy}@swansea.ac.uk.

R. P. Botchen and T. Ertl are with Visualization and InteractiveSystems Group, University of Stuttgart, Universitaetsstr. 38, D-70569 Stuttgart, Germany. Email: {botchen,thomas.ertl}@vis.uni-stuttgart.de.

D. Weiskopf is with School of Computer Science, SimonFraser University, Burnaby, BC V5A 1S6, Canada. Email:[email protected].

I. M. Thornton is with Department of Psychology, Universityof Wales Swansea, Singleton Park, Swansea SA2 8PP, UK. Email:[email protected].

to users in appropriate visual representations. The ulti-mate challenge of video visualization is to provide userswith a means to obtain a sufficient amount of meaningfulinformation from one or a few static visualizations of avideo using O(1) amount of time, instead of viewing thevideo using O(n) amount of time where n is the lengthof the video. In other words, can we see time withoutusing time?

In many ways, video visualization fits the classicaldefinition of visual analytics [27]. For example, concep-tually it bears some resemblance to notions in mathe-matical analysis. It can be considered as an ‘integral’function for summarizing a sequence of images into asingle image. It can be considered as a ‘differential’function for conveying the changes in a video. It canbe considered as an ‘analytic’ function for computingtemporal information geometrically.

Video data is a type of 3D volume data. Similar tothe visualization of 3D data sets in many applications,one can construct a visual representation by selectivelyextracting meaningful information from a video volumeand projecting it onto a 2D view plane. However, inmany traditional applications (e.g., medical visualiza-tion), the users are normally familiar with the 3D objects(e.g., bones or organs) depicted in a visual representa-tion. In contrast with such applications, human observersare not familiar with the 3D geometrical objects depictedin a visual representation of a video because one spatialdimension of these objects is in fact the temporal dimen-sion. The problem is further complicated by the fact that,in most videos, each 2D frame is the projective view ofa 3D scene. Hence, a visual representation of a video ona computer display is, in effect, a 2D projective view ofa 4D spatiotemporal domain.

Depicting temporal information in a spatial geometricform (e.g., a graph showing the weight change of aperson over a period) is an abstract visual representationof a temporal function. We therefore call the projectiveview of a video volume an abstract visual representationof a video, which is also a temporal function. Consid-ering that the effectiveness of abstract representations,such as a temporal weight graph, has been shown inmany applications, it is more than instinctively plausibleto explore the usefulness of video visualization, for


which Daniel and Chen proposed the following threehypotheses [6]:

1) Video visualization is an (i) intuitive and (ii) cost-effective means of processing large volumes ofvideo data.

2) Well constructed visualizations of a video are ableto show information that numerical and statisticalindicators (and their conventional diagrammaticillustrations) cannot.

3) Users can become accustomed to visual featuresdepicted in video visualizations, or can be trainedto recognize specific features.

The main aim of this work is to evaluate these threehypotheses, with a focus on the visualization of motionevents in videos. Our contributions include:

• We have, for the first time, considered the videovisualization as a flow visualization problem, in ad-dition to a volume visualization. We have developeda technical framework for constructing scalar andvector fields from a video, and for synthesizingabstract visual representations using both volumeand flow visualization techniques.

• We have introduced the notion of visual signaturefor symbolizing abstract visual features that depictindividual objects and motion events. We have fo-cused our algorithmic development and user studyon the effectiveness of conveying and recognizingvisual signatures of motion events in videos.

• We have compared the effectiveness of severaldifferent abstract visual representations of motionevents, including solid and boundary representationsof extracted objects, difference volumes, motionflows depicted using glyphs and streamlines.

• We have conducted a user study, resulting in thefirst set of evidence for supporting hypothesis (3).In addition, the study has provided an interestingcollection of findings that can help us understandthe process of visualizing motion events throughtheir abstract visual representations.

• We have applied our understanding and the devel-oped techniques to a set of real videos collected asbenchmarking problems in a recent computer visionproject [10]. This has provided further evidence tosupport hypotheses (1) and (2).

II. RELATED WORK

Although video visualization was first introduced as anew technique and application of volume visualization[6], it in fact reaches out to a number of other disci-plines. The work presented in this paper relates to video

processing, volume visualization, flow visualization andhuman factors in motion perception.

Automatic video processing is a research area resid-ing between two closely related disciplines, image pro-cessing and computer vision. Many researchers studiedvideo processing in the context of video surveillance(e.g., [4], [5]), and video segmentation (e.g., [19], [25]).While such research and development is no doubt hugelyimportant to many applications, the existing techniquesfor automatic video processing are normally application-specific, and are generally difficult to adapt themselvesto different situations without costly calibration.

The work presented in this paper takes a differentapproach from automatic video processing. As outlinedin [27], it is intended to ‘take advantage of the humaneye’s broad bandwidth pathway into the mind to allowusers to see, explore, and understand large amounts ofinformation at once’, and to ‘convert conflicting anddynamic data in ways that support visualization andanalysis’.

A number of researchers have noticed the structuralsimilarity between video data and volume data com-monly seen in medical imaging and scientific com-putation, and have explored the avenue of applyingvolume rendering techniques to solid video volumes inthe context of visual arts [12], [16]. Daniel and Chen[6] approached the problem from the perspective ofvisualization, and demonstrated that video visualizationis potentially an intuitive and cost-effective means ofprocessing large volumes of video data.

Flow visualization methods have a long traditionin scientific visualization [17], [21], [31]. There existseveral different strategies to display the vector fieldassociated with a flow. One approach used in this paperrelies on glyphs to show the direction of a vector fieldat a collection of sample positions. Typically, arrowsare employed to encode direction visually, leading tohedgehog visualizations [15], [8]. Another approach isbased on the characteristic lines, such as streamlines,obtained by particle tracing. A major problem of 3D flowvisualization is the potential loss of visual informationdue to mutual occlusion. This problem can be addressedby improving the perception of streamline structures [14]or by appropriate seeding [11].

In humans, just as in machines, visual informationis processed by capacity and resource limited systems.Limitations exist both in space (i.e., the number of itemsto which we can attend) [22], [28] and in time (i.e.,how quickly we can disengage from one item to processanother) [20], [23]. Several recent lines of research haveshown that in dealing with complex dynamic stimulithese limitations can be particularly problematic [3]. For


example, the phenomena of change blindness [24] andi-nattentional blindness [18] both show that relatively largevisual events can go completely unreported if attentionis misdirected or overloaded. In any application wheremultiple sources of information must be monitored orarrays of complex displays interpreted, the additionalload associated with motion or change (i.e. the need tointegrate information over time) could greatly increaseoverall task difficulty. Visualization techniques that canreduce temporal load, clearly have important humanfactors implications.

III. CONCEPTS AND DEFINITIONS

A video, V, is an ordered set of 2D image frames,{I1, I2, . . . , In}. It is a 3D spatiotemporal data set, usu-ally resulting from a discrete sampling process such asfilming and animation. The main perceptive differencebetween viewing a still image and a video is that we areable to observe objects in motion (and stationary objects)in a video. For the purpose of maintaining the generalityof our formal definitions, we include motionlessness asa type of motion in the following discussions.

Let m be a spatiotemporal entity, which is an abstractstructure of an object in motion, and encompasses thechanges of a variety of attributes of the object includingits shape, intensity, color, texture, position in each image,and relationship with other objects. Hence the idealabstraction of a video V is to transform it to a collectionof representations of such entities {m1,m2, . . . ,}.

Video visualization is thereby a function, F : V→ I,that maps a video V to an image I, where F is normallyrealized by a computational process, and the mappinginvolves the extraction of meaningful information from Vand the creation of a visualization image I as an abstractvisual representation of V. The ultimate scientific aimof video visualization is to find functions that can createeffective visualization images, from which users canrecognize different spatiotemporal entities {m1,m2, . . . ,}‘at once’.

A visual signature, V (m), is a group of abstract visualfeatures related to a spatiotemporal entity m in a visual-ization image I, such that users can identify the object,the motion or both by recognizing V (m) in I. In manyways, it is notionally similar to a handwritten signatureor a signature tune in music. It may not necessarily beunique and it may appear in different forms and differentcontext. Its recognition depends on the quality of thesignature as well as the user’s knowledge and experience.

IV. TYPES OF VISUAL SIGNATURES

Given a spatiotemporal entity m (i.e., an object inmotion), we can construct different visual signatures

(a) five frames (No.: 0, 5, 10, 15, 20) selected from a video

(b) a visual hull (c) an object boundary

(d) a difference iamge (e) a motion flow

Fig. 1. Selected frames of a simple up-and-down motion, depictingthe first of the five cycles of the motion, together with examples ofits attributes associated with frame 1.

to highlight different attributes of m. As mentioned inSection III, m encompasses the changes of a variety ofattributes of the object. In this work, we focus on thefollowing time-varying attributes:• the shape of the object,• the position of the object,• the object appearance (e.g., intensity and texture),• the velocity of the motion.

Consider a 101-frame animation video of a simple objectin a relatively simple motion. As shown in Figure 1(a),the main spatiotemporal entity contained in the video isa textured sphere moving upwards and downwards in aperiodic manner.

To obtain the time-varying attributes about the shapeand position of the object concerned, we can extractthe visual hull of the object in each frame from thebackground scene. We can also identify the boundaryof the visual hull, which to a certain extent conveys therelationship between the object and the its surroundings(in this simple case, only the background). Figures 1(b)and (c) shows the solid and boundary representations ofa visual hull. To characterize the changes of the object


appearance, we can compute the difference between twoconsecutive frames, and Figure 1(d) gives an exampledifference image. We can also establish a 2D motionfield to describe the movement of the object betweeneach pair of consecutive frames, as shown in Figure 1(e).There is very large collection of algorithms for obtainingsuch attributes in the literature, and we will describe ouralgorithmic implementation in Section VI.

Compiling all visual hull images into a single volumeresults in a 3D scalar field, which we call the extractedobject volume. Similarly, we obtain an object boundaryvolume and a difference volume which are also in theform of 3D scalar fields. The compilation of all 2Dmotion fields in a single volumetric structure gives us amotion flow into the form of a 3D vector field. Giventhese attribute fields of the spatiotemporal entity m,we can now consider the creation of different visualsignatures for m.

One can find numerous ways to visualize such scalarand vector fields individually or in a combinationalmanner. Without over-complicating the user study to bediscussed in Section V, we selected four types of visu-alization for representing visual signatures. Each typeof visual signature highlights certain attributes of theobject in motion, and reflects a strength of a particularvolume or flow visualization technique. All four types ofvisualization can be synthesized in real time, for whichwe will give further technical details in Section VI.

We also chose the horseshoe view [6] as the primaryview representation. Despite that it has an obviousshortcoming (i.e., the x-axis on the right part of thevisualization points to the opposite direction of that onthe left), we cannot resist its enticing merits as follows:• It maximizes the use of a rectangular display area

typically available in a user interface.• It places four faces of a volume, including the

starting and finishing frames, in a front view.• It conveys the temporal dimension differently from

the two spatial dimensions.. Since our user study (Section V) did not require thehuman observer to identify a specific direction from avisual signature, on the whole the horseshoe view servedthe objectives of the user study better than other viewpresentations.

A. Type A: Temporal Visual Hull

This type of visual signature displays a projectiveview of the temporal visual hull of the object in motion.Steady features, such as background, are filtered away.Figure 2(a) shows a horseshoe view of the extractedobject volume for the video in Figure 1. The temporal

(a) Type A: temporal visual hull

(b) Type B: 4-band difference volume

(c) Type C: motion flow with glyphs

(d) Type D: motion flow with streamlines

Fig. 2. Four types of visual signatures of an up-and-down periodicmotion given in Figure 1.

visual hull, which is displayed as an opaque object, canbe seen wiggling up and down in a periodic manner.

B. Type B: 4-Band Difference Volume

Difference volumes played an important role in [6],where amorphous visual features rendered using volume


raycasting successfully depicted some motion events intheir application examples. However, their use of transferfunctions encoded very limited semantic meaning. Forthis work, we designed a special transfer function thathighlights the motion and the change of a visual hull,while using a relatively smaller amount of bandwidth toconvey the change of object appearance (i.e., intensityand texture).

Consider two example frames and their correspondingvisual hulls, Oa and Ob in Figures 3(a) and (b). Weclassify pixels in the difference volume to be computedinto four groups as shown in 3(c), namely (i) background(6∈ Oa∧ 6∈ Ob), (ii) new pixels (6∈ Oa∧ ∈ Ob), (iii)disappearing pixels (∈ Oa∧ 6∈ Ob), and (iv) overlappingpixels (∈Oa∧ ∈Ob). The actual difference value of eachpixel, which typically results from a change detectionfilter, is mapped to one of four bands according to thegroup that the pixel belongs to. This enables the designof a transfer function that encodes some semantics inrelation to the motion and geometric change.

For example, Figure 2(b) was rendered using the trans-fer function illustrated in Figure 3(d), which highlightsnew pixels in nearly-opaque red and disappearing pix-els in nearly-opaque blue, while displaying overlappingpixels in translucent grey and leaving background pixelstotally transparent. Such a visual signature gives a clearimpression that the object is in motion, and to a certaindegree, provides some visual cues to velocity.

(a) frames Ia and Ib (b) visual hulls Oa and Ob

(c) 4 semantic bands (d) color mapping

Fig. 3. Two example frames and their corresponding visual hulls.Four semantic bands can be determined using Oa and Ob, and anappropriate transfer function can encodes more semantic meaningaccording to the bands.

C. Type C: Motion Flow with Glyphs

In many video-related applications, the recognition ofmotion is more important than that of an object. Henceit is beneficial to enhance the perception of motion byvisualizing the motion flow field associated with a video.This type of visual signature combines the boundaryrepresentation of a temporal visual hull with arrowglyphs showing the direction of motion at individualvolumetric positions. It is necessary to determine anappropriate density of arrows, as too many would cluttera visual signature, or too few would lead to substantialinformation loss. We thereby used a combination ofparameters to control the density of arrows, which willbe detailed in Section VI. Figure 2(c) shows a Type Cvisual signature of a sphere in an up-and-down motion.In this particular visualization, colors of arrows werechosen randomly to enhance the depth cue of partiallyoccluded arrows.

Note that there is a major difference between themotion flow field of a video and typical 3D vector fieldsconsidered in flow visualization. In a motion flow field,each vector has two spatial components and one temporalcomponent. The temporal component is normally set toa constant for all vectors. We experimented with a rangeof different constants for the temporal component, andfound that a non-zero constant would confuse the visualperception of the two spatial components of the vector.We thereby chose to set the temporal components of allvectors to zero.

D. Type D: Motion Flow with Streamlines

The visibility of arrow glyphs requires them to bedisplayed in a certain minimum size, which often leadsto the problem of occlusion. One alternative approachis to use streamlines to depict direction of motion flow.However, because all temporal components in the motionflow field are equal to zero, each streamline can onlyflow within the x-y plane where the corresponding seedresides, and it seldom flows far. Hence there is often adense cluster of short streamlines, making it difficult touse color for direction indication.

To improve the sense of motion and the perceptionof direction, we mapped a zebra-like dichromic textureto the line geometry, which moves along the line inthe direction of the flow. Although this can no longerbe consider strictly as a static visualization, it is not inany ways trying to recreate an animation of the originalvideo. The dynamics introduced is of a fixed number ofsteps, which are independent from the length of a video.The time requirement for viewing of such a visualizationremains to be O(1). Figure 2(d) shows a static view of


such a visual signature. The perception of this type ofvisual signatures normally improves when the size andresolution of the visualization increase.

V. A USER STUDY ON VISUAL SIGNATURES

The discussions in the previous sections naturally leadto many scientific questions concerning visual signatures.The followings are just a few examples:• Can users distinguish different types of spatiotem-

poral entities (i.e., types of objects and types ofmotion individually and in combination) from theirvisual signatures?

• If the answer to the above is yes, how easy is it foran ordinary user to acquire such an ability?

• What kind of attributes are suitable to be featuredor highlighted in visual signatures?

• What is the most effective design of a visual signa-ture, and in what circumstances?

• What kind of visualization techniques can be usedfor synthesizing effective visual signatures?

• How would the variations of camera attributes,such as position and field of view, affect visualsignatures?

• How would the recognition of visual signaturesscale in proportion to the number of spatiotemporalentities present?

.Almost all of these questions are related to the human

factors in visualization and motion perception. There isno doubt that user studies must play a part in our searchfor answers to these questions. As an integral part of thiswork, we conducted a user study on visual signatures.

Because this is the first user study on visual signaturesof objects in motion, we decided to focus our study onthe recognition of types of motion. We therefore set themain objectives of this user study as:

1) to evaluate the hypothesis that users can learn torecognize motions from their visual signatures.

2) to obtain a set of data that measures the difficultiesand time requirements of a learning process.

3) to evaluate the effectiveness of the above-mentioned four types of visual signatures.

A. Types of Motion

As mentioned before, an abstract visual representationof a video is essentially a 2D projective view of our4D spatiotemporal world. Visual signatures of spatiotem-poral entities in real life videos can be influenced bynumerous factors and appear in various forms. In orderto meet the key objectives of the user study, it wasnecessary to reduce the number of variables in this

scientific process. We hence used simulated motions withthe following constraints:

• All videos feature only one sphere object in motion.The use of sphere minimizes the variations of visualsignatures due to camera positions and perspectiveprojection.

• In each motion, the centre of the sphere remainsin the same x-y plane, which minimizes the ambi-guity caused by the change of object size due toperspective projection.

• Since the motion function is known, we computedmost attribute fields analytically. This is similar toan assumption that the sphere is perfectly texturedand lit and without shadows, which would minimizethe errors in extracting various attribute fields usingchange detection and motion estimation algorithms.

We consider the following seven types of motion:

1) Motion Case 1: No motion — in which the sphereremains in the centre of the image frame through-out the video.

2) Motion Cases 2-9: Scaling — in which the radiusof the sphere increases by 100%, 75%, 50% and25%, and decreases by 25%, 50%, 75% and 100%respectively.

3) Motion Cases 10-25: Translation — in which thesphere moves in a straight line in eight differentdirections. (i.e., 0◦,45◦,90◦, . . . ,315◦) and two dif-ferent speeds.

4) Motion Cases 26-34: Spinning — in which thesphere rotates about the x-axis, y-axis and z-axis,without moving its centre, with 1, 5 and 9 revolu-tions respectively.

5) Motion Cases 35, 38, 41: Periodic up-and-downtranslation — in which the sphere moves upwardsand downwards periodically in three different fre-quencies, namely 1, 5 and 9 cycles.

6) Motion Cases 36, 39, 42: Periodic left-and-righttranslation — in which the sphere moves towardsleft and right periodically in three different fre-quencies, namely 1, 5 and 9 cycles.

7) Motion Cases 37, 40, 43: Periodic rotation — inwhich the sphere about the centre of the imageframe periodically in three different frequencies,namely 1, 5 and 9 cycles.

The first four types are considered to be elementarymotions. The last three are composite motions whichcan be decomposed into a series of simple translationmotions in smaller time windows. Figure 4 shows thefour different visual signatures for each of the fiveexample cases.


(a) Motion Case 1: the sphere is stationary

(b) Motion Case 2: the radius of the sphere increases by 100%

(c) Motion Case 25: the sphere moves towards the upright corner of the image frame

(d) Motion Case 31: the sphere spins about the z-axis without moving its center

(e) Motion Case 40: the sphere makes 5 rotations about the centre of the image frame

Fig. 4. Four types of visual signatures for five example motion cases. From left to right: Type A (temporal visual hull), Type B (4-banddifference volume), Type C (motion flow with glyphs), and Type D (motion flow with streamlines).

We did consider to include other composite motions,such as the periodic scaling, and combined scaling,translation and spinning, but decided to limit the totalnumber of cases in order to obtain an adequate numberof samples for each case while controlling the timespent by each observer in the study. We also made aconscious decision not to include complex motions suchas deformation, shearing and fold-over in this user study.

B. Method

1) Participants: Sixty nine observers (23 Female, 46Male) from the University of Wales Swansea studentcommunity took part in this study. All observers hadnormal or corrected to normal vision and were given a £2book voucher each as a small thankyou gesture for theirparticipation. Data from two participants were excludedfrom analysis as their responses times were more than 3standard deviations outside of the mean. Thus, data from

67 (22 Female, 45 Male) observers were analyzed.2) Task: The user study was conducted in 14 sessions

over a three week period. Each session, which involvedabout 4 or 5 observers, started with a 25 minutes oralpresentation, given by one of the co-authors of thispaper, with the aid of a set of pre-written slides. Thepresentation was followed by a test, typically takingabout 20 minutes to complete. A piece of interactivesoftware was specially written for structuring the test aswell as collecting the results.

The presentation provided an overview of the scientificbackground and objectives of this user study, and gaves abrief introduction to the four types of visual signatures,largely in the terminology of a layperson. It outlinedthe steps of the test, and highlighted some potentialdifficulties and misunderstandings. As part of a learningprocess, a total of 10 motions and 11 visual signatureswere shown as examples in the slides.

The test was composed of 24 experiment trials. On


each trial, the observer was presented with between 1 and4 visual signatures of a motion. The task was simply toidentify the underlying motion pattern by selecting fromthe 4 alternatives listed at the bottom of the screen. Thisselection was achieved by clicking on the appropriateradio button. Both the speed and the accuracy of thisresponse were measured. As observers were allowed tocorrect initial responses, the final reaction time was takenfrom the point when they proceed to thenext part of thetrial.

The second part of the trial was designed to providefeedback and training for the observers to increase thelikelihood of learning. It also provided a measure ofsubjective utility, that is, how useful observers foundeach type of visual signature. In this part, the underlyingmotion clip was shown in full together with all four typesof visual signatures. The task was simply to indicatewhich of the four visual signatures appeared to providethe most relevant information. This was not a speededresponse and no responses time was measured.

At the end of the experiment, observers were alsoasked to provide an overall usefulness rating for eachtype of visual signature. A rating scale from 1 (least) to5 (most) effective was used.

3) Design: The 24 trials in each test were blockedinto 4 equal learning phases (6 trials per phase) inwhich the amount of available information was varied. Inthe initial phase all 4 visual signatures were presented,providing the full range of information. In each succes-sive phase, the number of available representations wasreduced by one, so that in the final phase only one visualsignature was provided. This fixed order was imposed sothat observers would receive sufficient training beforebeing presented with minimum information. For eachobserver a random sub-set of the 43 motion cases wereselected and randomly assigned to the 24 experimentaltrials. For each case, the 4 possible options were fixed.The position of options was however randomized on thescreen on an observer by observer basis to minimizesimple response strategies.

C. Results and Remarks

The user study provided with a large volume ofinteresting data, from which we have gained some usefulinsight about the likely correlation among types ofmotion, types of visual signatures, types of users, andprocess of learning. Here we provide some summarydata for supporting our major findings. We will brieflylist some of our other observations at the end of thissection.

Table I gives the mean accuracy (in percentage) andresponses time (in second) in relation to motion types.

We can clearly observe that for some types of motion,such as scaling and motionlessness, the success ratefor positive identification of a motion from its visualsignatures. Meanwhile, the responses time also indicatesthat these two types of motion are relatively easy torecognize. Note that we only include the responsestimes of those successful trials in calculating the meanresponses time.

Spinning motion appears to be the most difficult torecognize. This is because the projection of the sphere inmotion maintains the same outline and position through-out the motion. As we can see from Figure 4, thetemporal visual hull of Motion Case 31 which is aspinning motion, is identical to that of Motion Case1 which is motionless. This renders Type A visualsignature totally ineffective in differentiate any spinningmotion from motionless.

For the translational motions, including the elementarymotion in one direction, and combinational motion withperiodical change of directions, it generally takes timefor the observers to work out the motion from their visualsignatures.

TABLE IMEAN ACCURACY AND RESPONSES TIME IN RELATION TO

MOTION TYPES.

Accuracy Responses time (in second)

Static 87.5% 19.9Scaling 89.3% 14.0Translation 67.9% 22.6Spinning 50.6% 24.0Periodic 63.0% 24.7

Table II gives the mean accuracy (in percentage) andresponses time (in second) in each of the four phases.From the data, we can observe an unmistakable decreaseof response time following the progress of the test.However, the improvement of the success rate is lessdramatic, and in fact the trend reversed in Phase 4. Thisin fact was not a surprise at all. Because the designof the user study featured the reduction of the numberof visual signatures from initially four in Phase 1 tofinally one in Phase 4, the tasks were becoming hardwhen progressing from one phase to another. Therewas a steady improvement of accuracy from Phase 1to Phase 3. This indicating that the observers werelearning from their successes and mistakes. The resultshave also clearly indicated that when only one visualsignature available, the observers became less effective,and appeared to have lost some of the ‘competence’gained in the first three phases.

One likely reason is that a single visual signature is


often ambitious. The fact that Motion Cases 1 and 31in Figure 4 share the same Type A visual signature isone example of many such ambitious situations. Anotherpossible reason is that the lack of a confirmation processbased on a second visual signature may have resulted inmore errors in the decision process.

TABLE IIMEAN ACCURACY AND RESPONSES TIME IN EACH PHASE.

Accuracy Responses time (in second)

Phase 1 66.7% 30.8Phase 2 70.1% 22.2Phase 3 72.1% 17.5Phase 4 62.9% 13.4

Figure 5 shows the accuracy in relation to each type ofmotion in each phase. We can observe that the spinningmotion seemed to benefit more from having multiple vi-sual signatures available at the same time. The noticeabledecease of the number of positive identification of themotionless event in Phase 3 may also be caused by thedifficulties in differentiating it from spinning motions.Figure 6 shows a consistent reduction of responses timefor all types of motion.

Figure 7 summarizes the preference of observers interms of types of visual signatures. For each typesof motion, the preference shown by the observers haslargely reflected the effectiveness of each type of visualsignatures. Note that as the number of appearance ofeach type of motion corresponds to the number of casesin each type of motion. For example, as there is onlyone motionless case, the total number of ‘votes’ for the‘static’ category is much lower than other types. Notethat the Type C visual signature was considered to bethe most effective in relation to the spinning motion.

The overall preference (shown on the right of Figure 7was calculated by putting all ‘votes’ together regardlessthe type of motion involved. This corresponds reasonablywell with the final scores, ranging between 1 (least) to 5(most) effective, given by the observers at the end. Themean scores for the four types of visual signatures areA:2.6, B:4.0, C:3.6 and D:3.1 respectively.

We have also made some other observations, includingthe followings• There was no consistent or noticeable difference

between different genders in their success rate.• There was more noticeable difference between dif-

ferent age groups. For instance, the ‘less than 20’age group did not perform well in terms of accuracy.

• The relative lower scores received by Type D visualsignature may be partially affected by the size ofthe image used in the test. The display of such

Fig. 5. The mean accuracy, measured in each of the four phases,categorized by the types of motion.

Fig. 6. The mean responses time, measured in each of the fourphases, categorized by the types of motion.

a visual signature on a projection screen receivedmore positive reaction from the observers.

• Periodic translation events with more than 5 and 9cycles are easier to recognize than those with onlyone cycle.

• The observers were made aware of the shortcomingof the horseshoe view, and most became accustomedto the axis flipping scenario.

VI. SYNTHESIZING VISUAL SIGNATURES

Figure 8 shows the overall technical pipeline imple-mented in this work for processing captured videos andsynthesizing abstract visual representations. The maindevelopment goals for this pipeline were:• to extract a variety of intermediate data sets that

represent attribute fields of a video. Such data setsinclude extracted object volume, difference volume,boundary volume and optical flow field.

• to synthesize different visual representations usingvolume and flow visualization techniques individu-ally as well as in a combined manner.

• to enable real-time visualization of deformed videovolumes (i.e., the horseshoe view), and to facilitateinteractive specification of viewing parameters andtransfer functions.


Fig. 7. The relative preference of each type of visual signatures,presented in the percentage term, and categorized by the types ofmotion. The overall preference is also given.

In addition to this pipeline, there is also a software toolfor generating simulated motion data for our user studydescribed in Section V. In the following two sections, webriefly describe the two groups of functional modules inthe pipeline.

A. Video Processing

Given a video V stored as a sequence of images{I1, I2, . . . , In}, the goal of the video processing modulesis to generate appropriate attribute fields, which include:• Extracted Object Volume — which is stored as

a sequence of images {O1,O2, . . . ,On}. For eachvideo, we manually identify a reference image,Ire f , which represents an empty scene. We thenuse a change detection filter, Oi = Φo(Ii, Ire f ), i =1,2, . . . ,n, to compute the extracted object volume.For the application case studies to be discussed inSection VII, we employed the improved version ofthe linear dependence detector (LDD) [9] as Φ.

• 4-Band Difference Volume — which is also stored asa sequence of images {D2,D3, . . . ,Dn}. We can alsouse a change detection filter, Di = Φ4bd(Ii−1, Ii), i =2,3, . . . ,n, to compute a difference volume. Wedesigned and implemented such a filter by extendingthe LDD algorithm to include the 4-band classifica-tion described in IV-B.

• Object Boundary Volume — which is stored as asequence of images {E1,E2, . . . ,En}. This can easilybe computed using applying an edge detection filterto either images in the original volume, or thosein the extracted object volume. For this work, weutilized the public domain software, pamedge, inLinux for computing the object boundary volumeas Ei = Φe(Oi), i = 1,2, . . . ,n.

• Optical Flow Field — which is stored as a sequenceof 2D vector fields {F2,F3, . . . ,Fn}. In video pro-cessing, it is common to compute a discrete time-varying motion field from a video by estimating

Video Processing

Rendering

extractedobject

volume

opticalflowfield

pre−computedseed list

4−banddifference

volume

objectboundaryvolume

ChangeDetection

EdgeDetection

Optical FlowEstimation

SeedGeneration

capturedvideodata

Load Data into the Framework

Create Geometry and Fill Volume

Volume Slicer

Slice Tesselator

Horseshoe Bounding Box Renderer

Horseshoe Flow Geometry Renderer

Horseshoe Volume Renderer

User

Display

Interfaceand

Visualization

abstractvisual

representation

Fig. 8. The technical pipeline for processing video and synthesizingabstract visual representations. Data files are shown in pink, softwaremodules in blue, and hardware-assisted modules in yellow.

the spatiotemporal changes of objects in a video.The existing methods fall into two major categories,namely object matching and optical flow [29]. Theformer (e.g., [1]) involves the knowledge of aboutsome known objects such as its 3D geometry orIK-skeleton, The latter (e.g., [13], [2], [26]) is lessaccurate but can be applied to almost any arbitrarysituation. To compute optical flow of a video, weadopted the gradient-based differential method byHorn and Schunck [13]. The original algorithm waspublished in 1981, and many variations have beenproposed since. Our implementation was based onthe modified version by Barron et al. [2], resultingin a filter Fi = Φo f (Ii−1, Ii), i = 2,3, . . . ,n.

• Seed List — which is stored as a sequence of textfiles {S2,S3, . . . ,Sn}, each contains a list of seeds forthe corresponding image. We implemented a seed


generation filter, which computes a seed file from avector field as Si = Φs(Fi), i = 2,3, . . . ,n. There arethree parameters, which can be used in combination,for controlling the density of seeds, name gridinterval, minimal vector length, and probability inrandom selection.

B. GPU Rendering

The rendering framework was implemented in C++,using Direct3D as the graphics API and HLSL asthe programming language for the GPU. It realizes apublisher-subscriber notification concept with a sharedstate interface for data storage and interchange of allcomponents. Following the principle of separation ofconcerns [7], the renderer is split into three reasonedlayers:

• Application Layer — which handles user interactionand application logic.

• Data Layer — which stores all necessary data (e.g.volume data, vector fields, transfer functions, seedpoints, trace lines).

• Rendering Layer — which contains all algorithmiccomponents for volume and geometry rendering.

The communication between different layers for ex-changing information is one of the most important func-tions in this framework. We implemented this functional-ity through the shared state, which defines an enumeratorof a set of uniquely defined state variables, storingpointers to all subscribed variables and data structures.Hence, all layers have access to necessary informationthrough simple ‘getter’ and ‘setter’ interface functions,independent of the data type. The return value of thesefunctions provides only a pointer to the memory addressof the stored data, and thus minimizes the amount ofdata flow.

As our rendering framework uses programmable hard-ware achieve real-time rendering, it needs to map allrendering stages to the architecture of the GPU pipeline.For volume rendering, this can be centralized by twomajor terms, namely proxy geometry generation as in-put to GPU geometry processing and rendering of thevolume as part of the GPU fragment processing. Ourimplementation of the texture-based horseshoe volumerenderer is partly based on an existing GPU volumerendering framework [30].

Since the framework needs to support both volume andflow visualization, it is necessary to introduce the displayof real geometry in the rendering pipeline. We thereforeextended the pipeline to include five succeeding stages,which are illustrated in Figure 8, and outlined below.

• The first two modules (in yellow) generates theproxy geometry that is used to sample and recon-struct the volume on the GPU. First, the VolumeSlicer module creates a set of view-aligned proxyslices according to the current view direction andsampling distance, and it stores a pointer to thebuffer in the shared state.

• Those slice polygons are then triangulated by theSlice Tesselator, which creates vertex and indexbuffers, and stores a pointer to them in the sharedstate, to be used later on by the Horseshoe VolumeRenderer.

• The Horseshoe Volume Renderer module triggersthe actual volume rendering process on the hard-ware, passing the vertex buffer, the index buffer,and the 3D volume texture to it.

• The two interposed geometry-based modules,Horseshoe Bounding Box Renderer and HorseshoeFlow Geometry Renderer are not part of the volumerendering process, but rather create and directlyrender geometry such as arrows, tracelines and thedeformed bounding box.

Technically, it is more appropriate to run bothgeometry-based modules prior to the Horseshoe VolumeRenderer module. Hence, the rendering process is actu-ally executed as follows. The horseshoe bounding boxis first drawn with full opacity, depth-writing and depth-testing turned on. Then, the flow geometry is renderedto the frame buffer with traditional α-blending. Finally,depth-writing is turned off, and the proxy geometry isrendered, using α-blending to accumulate all individualslices to the frame buffer. The reconstructed volumeblends into the the geometry drawn by the two geometry-based modules. By using this order of rendering, it isassured that the geometry, such as the bounding box, isnot occluded by the volume. Furthermore, the intensityof objects that are placed inside the volume, are moreattenuated by the volume structures, the farer they arebehind them. This leads to an improved depth perceptionfor the users.

Table III gives the performance results for the 200×200× 100 data sets used in the user study. The timingis given in frames per second(fps). All timings weremeasured on a standard desktop PC with a 3.4GHzPentium 4 processor and a NVIDIA GeForce 7800 GTXbased graphics board.

VII. APPLICATION CASE STUDIES

We have applied our understanding and developedthe techniques to a set of video clips collected in theCAVLAR project [10] as benchmarking problems for


TABLE IIIPERFORMANCE RESULTS FOR THE CASE STUDY DATASETS.

Viewport size 800×600 1024×768 1280×1024

300 slices 25.8 16.5 9.4600 slices 12.9 8.3 4.7

(a) a selected image frame

(b) extracted objects (c) 4-band difference

(d) a computed optical flow

Fig. 9. A selected scene from the video ‘Fight OneManDown’collected by the CAVLAR project [10], and its associated attributescomputed in the video processing stage.

computer vision. In particular, we considered a collectionof 28 video clips filmed, in the entrance lobby of the

INRIA Labs at Grenoble, France, from a similar cameraposition using a wide angle lens. Figure 9(a) shows atypical frame of the collection, with actors highlightedin red, non-acting visitors in yellow. All videos havethe same resolution with 384×288 pixels per frame and25 frames per second. As all videos are available incompressed MPEG2, there is a noticeable amount ofnoise which presents a challenge to the synthesis ofmeaningful visual representations for these video clips aswell as automatic object recognition in computer vision.

The video clips recorded a variety of scenarios ofinterest, include people walking alone and in group,meeting with others, fighting and passing out, and leav-ing a package in a public place. Because the relativelyhigh camera position and almost all motions took placeon the ground, the view of the scene exhibits somesimilarity to the simulated view used in our user study.It is therefore appropriate and beneficial to examine thevisual signatures of different types of motion eventsfeatured in these videos.

In this work, we tested several change detection al-gorithms as studied in [6], and found that the lineardifference detection algorithm [9] is most effective forextracting an object representation. As shown in Fig-ure 9(b), there is a significant amount of noise at thelower left part of the image, where the sharp contractbetween external lighting and shadows is especiallysensitive to the minor camera movements, and the lossycompression algorithm used in capturing these videoclips. In many video clips, there were also non-actingvisitors browsing in that area, resulting in more com-plicated noise patterns. Using the techniques describedin Section VI-A, we also computed a 4-band differenceimage between each pair of consecutive frames using thefilter Φ4bd (Figure 9(c)), and an optical flow field usingthe filter Φo f (Figure 9(d)).

Figure 10 shows the visualization of threevideo clips in the CAVLAR collection [10]. In the‘Fight OneManDown’ video, two actors first walkedtowards each other, then fought. One actor knocked theother down, and left the scene. From both Type B andType C visualization, we can identify the movementsof people, including the two actors and some othernon-acting visitors. We can also recognize the visualsignature for the motion when one of the actor was onthe floor. In Type B, this part of track appears to becolorless, while in Type C, there is no arrow associatedwith the track. This hence indicates the lack of motion.In conjunction with the relative position of this partof the track, it is possible to deduce that a person ismotionless on the floor.

We can observe a similar visual signature in part


(a) Five frames from the ‘Fight OneManDown’ video, together with its Type B and Type C visualization.

(b) Three frames from the ‘Meet WalkTogether2’ video, (c) Three frames from the ‘Rest SlumpOnFloor’ video,together with its Type C visualization. together with its Type C visualization.

Fig. 10. The visualizations of three video clips in the CAVLAR collection [10], which feature three different situations involving peoplewalking, stopping and/or falling onto the floor.

of the track in Figure 10(c). We may observe that inFigure 10(b), parts of the two tracks also appear to bemotionless. However, there is no significant change ofthe height of the tracks. In fact, the motionlessness isdue to the fact that two actors in Figure 10(b) stoppedto greet each other.

Figure 11 shows three different situations involvingpeople leaving things around in the scene. Firstly, wecan recognize the visual signature of the stationaryobjects brought into the scene (e.g., a bag or a box)in Figures 11(a), (b) and (c). We can also observe thedifference among three. In (a), the owner appeared tohave left the scene after leaving an object (i.e., a bag)behind. Someone (in fact it was the owner himself) latecame back to pick up the object. In (b), an object (i.e.,a bag) was left only for a short period, and the owner

was never far from it. In (c), the object (i.e., a box) wasleft in the scene for a long period, and the owner alsoappeared to walk away from the object in an unusualpattern.

Visual signatures of spatiotemporal entities in reallife videos can be influenced by numerous factors andappear in various forms. Such diversity does not in anyway undermine the feasibility of video visualization,and on the contrary, it rather strengthens the argumentfor involving the ‘bandwidth’ of the human eyes andintelligence in the loop. The above examples can beseen as further evidence showing the benefits of videovisualization.


(a) Five frames from the ‘LeftBag’ video, together with its Type B and Type C visualization.

(b) Type C visualization for the ‘LeftBag PickedUp’ video (c) Type C visualization for the ‘LeftBox’ video

Fig. 11. The visualizations of three video clips in the CAVLAR collection [10], which feature three different situations involving peopleleaving things around. We purposely left out the original video frames for the ‘LeftBag PickedUp’ and LeftBox’ videos.

VIII. CONCLUSIONS

We have presented a broad study of visual signaturesin video visualization. We have successfully introducedflow visualization to assist in the depicting motion fea-tures in visual signatures. We found that the flow-basedvisual signatures were essential to certain the recognitionof certain types of motion, such as spinning, thoughthey appeared to demand more display bandwidth andmore effort from observers. In particular, in our fieldtrial, combined volume and flow visualization was shownto be more the most effective means for conveying theunderlying motion actions in real-life videos.

We have conducted a user study, which provided uswith an extensive set of useful data about human factorsin video visualization. In particular, we have obtainedthe first set of evidence showing that human observerscan learn to recognize types of motion from their visualsignatures. Considering that the most observers had littleknowledge about visualization technology in general,over 80or above success rate within a 45 minute learningprocess. Some of the findings obtained in this user study

have also indicated the possibility that perspective pro-jection in a video may not necessary be a major barrier,since human observers can recognize size changes atease. We will conduct further user studies on in this area.

We have designed and implemented a pipeline forsupporting the studies on video visualization. Throughthis work we have also obtained some first-hand evalua-tion as to the effectiveness of different video processingtechniques and visualization techniques.

While this paper has given a relatively broad treatmentof the subject, it only represents the start of a new longterm research program.

REFERENCES

[1] P. Anandan. A computational framework and an algorithmfor measurement of visual motion. International Journal ofComputer Vision, 2:283–310, 1989.

[2] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performanceof optical flow techniques. Internation Journal of ComputerVision, 12(1):43–77, 1994.

[3] P. Cavanagh, A. Labianca, and I. M. Thornton. Attention-basedvisual routines: Sprites. Cognition, 80:47–60, 2001.


[4] R. Chellappa. Special section on video surveillance (editorialpreface). IEEE Transactions on Pattern Analysis and MachineIntelligence, 22(8):745–746, 2000.

[5] R. Cutler, C. Shekhar, B. Burns, R. Chellappa, R. Bolles,and L.S. Davis. Monitoring human and vehicle activitiesusing airborne video. In Proc. 28th Applied Imagery PatternRecognition Workshop (AIPR), Washington, D.C.,USA, 1999.

[6] G. W. Daniel and M. Chen. Video visualization. In Proc. IEEEVisualization, pages 409–416, Seattle, WA, October 2003.

[7] E. W. Dijkstra. A Discipline of Programming. Prentice-Hall,1976.

[8] Don Dovey. Vector plots for irregular grids. In Proc. IEEEVisualization, pages 248–253, 1995.

[9] T. Ebrahimi E. Durucan. Improved linear dependence andvector model for illumination invariant change detection. InProc. SPIE, volume 4303, San Jose, California, USA, 2001.

[10] Robert B. Fisher. The PETS04 surveillance ground-truth datasets. In Proc. 6th IEEE International Workshop on PerformanceEvaluation of Tracking and Surveillance, pages 1–5, 2004.

[11] S. Guthe, S. Gumhold, and W. Straßer. Interactive visualizationof volumetric vector fields using texture based particles. InProc. WSCG Conference Proceedings, pages 33–41, 2002.

[12] A. Hertzmann and K. Perlin. Painterly rendering for videoand interaction. In Proc. 1st International Symposium onNon-Photorealistic Animation and Rendering, pages 7–12, June2000.

[13] B. K. P. Horn and B. G. Schunk. Determining optical flow.Artificial Intelligence, 17:185–201, 1981.

[14] Victoria Interrante and Chester Grosch. Visualizing 3D flow.IEEE Computer Graphics and Applications, 18(4):49–53, 1998.

[15] R. Victor Klassen and Steven J. Harrington. Shadowed hedge-hogs: A technique for visualizing 2D slices of 3D vector fields.In Proc. IEEE Visualization, pages 148–153, 1991.

[16] A. W. Klein, P. J. Sloan, R. A. Colburn, A. Finkelstein, andM. F. Cohen. Video cubism. Technical Report MSR-TR-2001-45, Microsoft Research, October 2001.

[17] Robert S. Laramee, Helwig Hauser, Helmut Doleisch,B. Vrolijk, F. H. Post, and D. Weiskopf. The state of theart in flow visualization: Dense and texture-based techniques.Computer Graphics Forum, 23(2):143–161, 2004.

[18] A. Mack and I. Rock. Inattentional Blindness. MIT Press,Cambridge MA, 1998.

[19] N. V. Patel and I. K. Sethi. Video shot detection and character-ization for video databases. Pattern Recognition, Special Issueon Multimedia, 30(4):583–592, 1997.

[20] M. I. Posner, C. R. R. Snyder, and B. J. Davidson. Attention andthe detection of signals. Journal of Experimental Psychology:General, 109:160–174, 1980.

[21] F. H. Post, B. Vrolijk, H. Hauser, R. S. Laramee, andH. Doleisch. The state of the art in flow visualization: Featureextraction and tracking. Computer Graphics Forum, 22(4):775–792, 2003.

[22] Z. W. Pylyshyn and R. W. Storm. Tracking multiple inde-pendent targets: Evidence for a parallel tracking mechanism.Spatial Vision, 3:179–197, 1988.

[23] J. E. Raymond, K. L. Shapiro, and et al. Temporary sup-pression of visual processing in an RSVP task: An attentionalblink. Journal of Experimental Psychology, HPP, 18(3):849–860, 1992.

[24] D. J. Simons and R. A. Rensink. Change blindness: past,present, and future. Trends in Cognitive Sciences, 9(1):16–20,2005.

[25] C. G. M. Snock and M. Worring. Multimodal video indexing:a review of the state-of-the-art. Multimedia Tools and Applica-tions, 2003.

[26] Robert Strzodka and Christoph S. Garbe. Real-time motionestimation and visualization on graphics cards. In Proc. IEEEVisualization, pages 545–552, 2004.

[27] James J Thomas and Kristin A. Cook, editors. Illuminatingthe Path: The Research and Development Agenda for VisualAnalytics. IEEE Press, 2005.

[28] A. W. Treisman and G. Gelade. A feature integration theory ofattention. Cognitive Psychology, 12:97–136, 1980.

[29] A. Verri and T. Poggio. Motion field and optical flow:Qualitative properties. IEEE Transactions on Pattern Analysisand Machine Intelligence, 11(5):490–498, 1989.

[30] Joachim E. Vollrath, Daniel Weiskopf, and Thomas Ertl. Ageneric software framework for the gpu volume renderingpipeline. In Proc. Vision, Modeling and Visualization, pages391–398, 2005.

[31] Daniel Weiskopf and Gordon Erlebacher. Overview of flowvisualization. In Charles. D. Hansen and Christopher R.Johnson, editors, The Visualization Handbook, pages 261–278.Elsevier, Amsterdam, 2005.

Documents

12345efghi - cs.swan.ac.uk