Multimedia Presentation of Interpreted Visual Data - DFKI

Sonderforschungsbereich 314Kunstliche Intelligenz - Wissensbasierte Systeme

KI-Labor am Lehrstuhl fur Informatik IV

Leitung: Prof. Dr. W. Wahlster

VITRAUniversitat des SaarlandesFB 14 Informatik IVPostfach 151150

D-66041 SaarbruckenFed. Rep. of Germany

Tel. 0681 / 302-2363

Bericht Nr. 103

Multimedia Presentation of Interpreted

Visual Data

Elisabeth Andre, Gerd Herzog, Thomas Rist

Juni 1994

ISSN 0944-7814 103

Multimedia Presentation of

Interpreted Visual Data�

Elisabeth Andr�e, Gerd Herzog,� Thomas Rist

German Research Center for Arti�cial Intelligence (DFKI)D-66123 Saarbr�ucken, Germanyfandre, [email protected]

� SFB 314, Project VITRA, Universit�at des SaarlandesD-66041 Saarbr�ucken, Germany

[email protected]

Juni 1994

Abstract

While computer vision aims at the transformation of image datainto meaningful information, research in intelligent multimedia gener-ation addresses the e�ective communication of information using mul-tiple media such as text, graphics and video. We argue that combiningthe two research areas leads to an interesting new kind of informationsystem. Such integrated systems will be able to exibly transformvisual data into various presentation forms including, for example,TV-style reports and illustrated articles. The paper elaborates onthis transformation and provides a modularization into maintainablesubtasks. How these subtasks can be accomplished will be sketched bymeans of Vips, a prototype system that has emerged from our previ-ous work on scene analysis and multimedia generation. Vips analysesshort sections of camera-recorded image sequences of soccer games andgenerates multimedia presentations of the interpreted visual data.

�To appear in: Proc. of AAAI-94, Workshop on \Integration of Natural Language and

Vision Processing", Seattle, WA, 1994.

1

1 Introduction

Image understanding systems which perform a qualitative interpretation of acontinuous ow of visual data allow observation of inaccessible areas and willrelease humans from time-consuming and often boring observation tasks, e.g,in tra�c control. Moreover, a sophisticated system may not only collect andcondense data but also interpret them in a particular context and provideinformation that goes far beyond the set of visual input data (cf. [Herzoget al. 89; Koller et al. 92; Neumann 89; Tsotsos 85; Wahlster et al. 83;Walter et al. 88]).

Intelligent multimedia generation systems which employ several mediasuch as text, graphics, and animation for the presentation of information (cf.[Arens et al. 93; Feiner & McKeown 93; Maybury 93; Roth et al. 91; Stock 91;Wahlster et al. 93]) increasingly attract attention in many application areassince they (1) are able to exibly tailor presentations to a user's individualneeds and style preferences, (2) may use one medium in place of another, and(3) may combine media so that the strength of one medium will overcomethe weakness of another.

Degree of Detailcomplete descriptions

only outstanding or un-expected observations

summary

Reporting Mode:simultaneous

retrospective

Used Media:authentic video

speech

diagrams

written text

- tv-style reports e.g.

. . .

- radio-style reports e.g.

. . .

- headlines e.g.

. . .

- illustrated newspaper reports

e.g. . . .

Presentation Styles:

Figure 1: Examples of presentation styles

Combining techniques for image understanding and intelligent multime-dia generation will open the door to an interesting new type of computer-

2

based information system that provides highly exible access to the visualworld.

To see the bene�ts of such systems we may look at information presen-tation in mass media like newspaper and television. There, multiple mediahave been used for years when reporting events, e.g., in sports reporting. Thespectrum of commonly used presentation forms covers printed, often illus-trated text, verbally commented pictures, authentic video clips, commentedvideo etc. However, the e�ort needed for manually preparing such presen-tations impedes broad production of presentations for individual users. Incontrast to that, an advanced computer-based reporting system could providea low-cost way to present the same information in various forms depending ongeneration parameters such as a user's actual interests and style preferences,time-restrictions etc.

Fig. 1 gives an impression of the variety of presentation styles that resultfrom combining only three basic criteria: the information requirements, thereporting mode (delay between data perception and information presenta-tion), the medium used in presentation.

The work described in this paper aims at a multimedia reporting system.Following the paradigm of rapid prototyping, we rely on our previous workin both analysis and interpretation of image sequences (cf. [Herzog et al.89; Herzog & Wazinski 94]) and generation of multimedia presentations (cf.[Wahlster et al. 93; Andr�e & Rist 90]). Short sections of video recordings ofsoccer games have been chosen as the domain of discourse since they o�erinteresting possibilities for the automatic interpretation of visual data in arestricted domain. Also, the broad variety of commonly used presentationforms in sports reporting provides a fruitful inspiration when investigatingmethods for automated generation of multimedia reports.

2 From Visual Data to Multimedia Presen-

tations

Our e�orts aim at a system that essentially transforms acquired visual datainto meaningful information which in turn will be transformed into a struc-tured multimedia presentation. Fig. 2 provides a classi�cation of represen-tation formats as they may be used to bridge between the di�erent stepsof the transformation. In the following, we describe a decomposition of thetransformation process into maintainable subtasks:

Processing image sequences

The processes on the sensory level start from digitized video frames and servefor the automated construction of symbolic computer-internal descriptions of

3

PresentationLevel

multimedia output

mm-discoursestructure

presentation goals

ConceptualLevel

SensoryLevel

GSD

digitized imagesequence

intentions andinteractions

eventpropositions

relation tuples

Meier passesthe ball to

... (Elaborate-Subevent S U ball-transfer#23 T)

... (Proceed [3:39:05] (Event ball-transfer#23))

... (Goal player#5 (attack player#8))

... (s-rel-in ball#1 penality-area#2)

.. .

. . .

TRAJ R1 R2OBJ#001 545.5 564.3 123.4 432.4OBJ#002 312.6 234.2 234.4 321.2

. . . . . . . . . . . . .

Figure 2: Levels of representation

perceived scenes. The analysis of time-varying image sequences is of partic-ular importance. In this case, the processing concentrates on the recognitionand tracking of moving objects. In the narrow sense, the intended output of avision system would be an explicit, meaningful description of visible objects.Throughout this paper, we will use the term geometrical scene description(GSD), introduced in [Neumann 89], for this kind of representation.

Interpretation of visual information

High-level scene analysis aims at recognizing conceptual units at a higherlevel of abstraction and thus extends the scope of image understanding. TheGSD serves as an intermediate representation between low-level image analy-sis and scene analysis and forms the basis for further interpretation processeswhich lead to representations constituting the conceptual level. These repre-sentations include spatial relations for the explicit characterization of spatial

4

arrangements of objects, representations of recognized object movements,and also higher-level concepts such as representations of behaviour and in-teraction patterns of the agents observed. Since one and the same scene maybe interpreted di�erently by di�erent observers, the interpretation processshould be exible enough to allow for situation-dependent interpretation.

Content selection and organization

Depending on the user's information needs the system has to decide whichpropositions from the conceptual level should be communicated. Even if adetailed description is requested, it would be inappropriate to mention everysingle proposition provided by the scene analysis component. Following gen-eral rules of communication as formulated by Grice [Grice 75], the systemhas to ensure that all relevant information will be encoded. On the otherhand, the user should not be unnecessarily informed about facts he alreadyknows. Furthermore, the system has to organize the selected contents in acoherent manner. Of course, a exible system that supports varying stylesof reporting cannot rely on single strategies for content selection and organi-zation. For example, in situations where all scene data are available beforethe generation process begins, content organization may use diverse sortingtechniques to enhance coherency. These techniques usually fail in live re-porting where visual data are to be described while they are recorded andinterpreted. In that case, however, emphasis is usually more on topicalitythan on coherency of the description.

Coordinated distribution of information on several media

An optimal exploitation of di�erent media requires a presentation system todecide carefully when to use one medium in place of another and how to inte-grate di�erent media in a consistent and coherent manner. This also includesdetermining an appropriate degree of complementarity and redundancy ofinformation presented in di�erent media. A presentation that contains noredundant information at all tends to be incoherent. If, however, too muchinformation is paraphrased in di�erent media, the user may concentrate onone medium after a short time and probably overlooks information.

Medium-speci�c encoding of information

A multimedia presentation system must manage the presentation of text,graphics, video, and whatever media it employs. In the simplest case, pre-sentation means automatic retrieval of already available output units, e.g,canned text or recorded video clips. More ambitious approaches addressgeneration from scratch. Such approaches incorporate design expertise andprovide mechanisms for the automatic selection, creation, and combination

5

of medium-speci�c primitives (e.g, words, icons, video frames, etc.). To en-sure coherency of presentations, output fragments in di�erent media have tobe tailored to each other. Therefore, no matter how medium-speci�c out-put will be produced, it is important that the system maintains an explicitrepresentation of the encodings used.

Output coordination

The last step of the transformation concerns the arrangement of presenta-tion fragments provided by the generators in a multimedia output. A purelygeometrical treatment of this layout task would, however, lead to unsatis-factory results. Rather, layout has to be considered as an important carrierof meaning. For example, two pictures that serve to contrast objects shouldbe placed side by side. When using dynamic media, such as animation andspeech, layout design also requires the temporal coordination of output units.

An identi�cation of subtasks as described above gives an idea of the pro-cesses that a reporting system has to maintain. The architectural organiza-tion of these processes is, however, a crucial issue, especially when strivingfor a system that supports various presentation styles. For example, the au-tomatic generation of live presentations calls (1) for an incremental strategyfor the recognition of object movements and assumed intentions, and (2) foran adequate coordination of recognition and presentation processes. Also,there are various dependencies between choices in the presentation part. Tocope with such dependencies, it seems unavoidable to interleave the processesfor content determination, mode selection and content realization.

3 VIPS: A Visual Information Presentation

System

The most straightforward approach to building a reporting system is to relyon existing modules for the interpretation of image sequences, and the gener-ation of multimedia presentations. When conceiving our prototype system,calledVips, we consequently follow this approach when the reuse of modulesfrom our previous systems Vitra [Herzog & Wazinski 94] and Wip [Andr�e& Rist 94] is possible. In the following, we sketch the processing mechanismsof Vips' core modules.

3.1 Image Analysis

For technical reasons, we do not directly incorporate a low-level vision com-ponent for the processing of the camera data into Vips. Rather, this task

6

is done with the systems Actions [Sung 88] and Xtrack [Koller et al. 92]

that have been developed by our partners at the Fraunhofer Institute forInformation and Data Processing (IITB) in Karlsruhe. Actions recognizesmoving objects within real world image sequences. It performs a segmenta-tion and cueing of moving objects by computing and analyzing displacementvector �elds. The more recent Xtrack system accomplishes a model-basedrecognition and classi�cation of rigid objects.

Sequences of up to 1000 images, i.e. 40 seconds play back time, recordedwith a stationary TV-camera during a game in the German professional soc-cer league, have been evaluated by the Actions system (cf. [Herzog et al.89]). In this domain, segmentation becomes quite di�cult because the mov-ing objects cannot be regarded as rigid and occlusions occur very frequently.

The as yet partial trajectories delivered byActions are currently used tosynthesize interactively a realistic GSD, with object candidates assigned topreviously known players and the ball. The approach described in [Rohr 94]

for the geometric modeling of an articulated body has been adopted in Vipsin order to represent the players in the soccer domain (cf. [Herzog 92b]). Thestationary part of the GSD, an instantiated model of the static background,is fed into the system manually.

3.2 Scene Interpretation

Many previous attempts at high-level scene analysis (e.g. [Neumann 89;Wahlster et al. 83; Walter et al. 88]) are based on an a posteriori inter-pretation strategy, which requires a complete GSD covering the entire imagesequence as soon as the analysis process starts. Hence, these systems gener-ate retrospective scene descriptions, only.

Greater exibility can be achieved if an incremental strategy is employed(cf. [Herzog et al. 89; Koller et al. 92; Tsotsos 85]), with a GSD con-structed step by step and processed simultaneously as the scene progresses.Immediate system reactions, as needed for live presentations, and within au-tonomous systems, are possible, because information about the present sceneis provided, too. In Vips high-level scene analysis includes:

Computation of spatial relations

In the GSD spatial information is encoded only implicitly. In analogy toprepositions, their linguistic counterparts, spatial relations provide a quali-tative description of spatial arrangements of objects. Each spatial relationcharacterizes a class of object con�gurations by specifying conditions, suchas the relative position of objects or the distance between them.

Instead of assigning simple truth values to spatial predications, a measureof degrees of applicability has been introduced that expresses the extent to

7

which a spatial relation is applicable (cf. [Andr�e et al. 87]). On the one hand,more exact scene descriptions are possible since the degree of applicabilitycan be expressed linguistically (e.g. `directly behind' or `more or less in frontof'). One the other hand, the degree of applicability can be used to select themost appropriate reference object(s) and relation if an object con�gurationcan be described by several spatial predications.

Our system is capable of computing topological (e.g. in, near, etc.) aswell as orientation-dependent relations (e.g. left-of, over, etc.). Since theframe of reference is explicitly taken into account, the system can cope withthe intrinsic, extrinsic, and deictic use of directional prepositions (cf. [Andr�eet al. 87; Gapp 94]).

Characterization and interpretation of object movements

When analyzing time-varying image sequences, spatio-temporal concepts canalso be extracted from the GSD. These conceptual units, which we will callmotion events, serve for the symbolic abstraction of the temporal aspectsof the scene. With respect to the natural language description of imagesequences events are meant to represent the meaning of motion and actionverbs.

The recognition of movements is based on event models, i.e., declara-tive descriptions of classes of higher conceptual units capturing the spatio-temporal aspects of object motions. The event concepts are organized into anabstraction hierarchy, grounded on specialization (e.g., running is a moving)and temporal decomposition (cf. Fig. 3). This conceptual hierarchy can alsobe utilized to guide the selection of the relevant propositions when produc-ing a presentation. Besides the question of which events are to be extractedfrom the GSD, it is decisive how the recognition process is realized. Withrespect to the generation of simultaneous multimedia presentations, the fol-lowing problem becomes obvious. If the presentation is to be focused onwhat is currently happening, it is very often necessary to describe objectmotions even while they occur. Thus, motion events have to be recognizedstepwise as they progress and event instances must be made available forfurther processing from the moment they are noticed �rst.

Since the distinction between events that have and those that have notoccurred is insu�cient, we have introduced the additional predicates start,proceed, and stop which can be used to characterize the progression of anevent (cf. [Andr�e et al. 88]).

Labeled directed graphs with edges of a certain type, so called coursediagrams, are used to model the prototypical progression of an event. Fig.4 shows a simpli�ed course diagram for the concept BALL-TRANSFER. Itdescribes a situation in which a player passes the ball to a teammate. Theevent starts if a BALL-POSSESSION event stops and the ball is free. The

8

Header:

(BALL-TRANSFER ?p1*player ?b*ball ?p2*player)Conditions:

(eql (TEAM ?p1) (TEAM ?p2))Subconcepts:

(BALL-POSSESSION ?p1 ?b) [I1](MOVE-FREE ?b) [I2](BALL-POSSESSION ?p2 ?b) [I3]Temporal-Relations:

[I1] :meets [BALL-TRANSFER][I1] :meets [I2][I2] :equal [BALL-TRANSFER][I2] :meets [I3]

Figure 3: Event model

event proceeds as long as the ball is moving free and stops when the recipienthas gained possession of the ball.

The recognition of an occurrence can be thought of as traversing thecourse diagram, where the edge types are used for the de�nition of the basicevent predicates. Course diagrams rely on a discrete model of time, which isinduced by the underlying sequence of digitized TV-frames. They allow incre-mental event recognition, since exactly one edge per unit of time is traversed.Using constraint-based temporal reasoning, course diagrams are constructedautomatically from interval-based concept de�nitions (cf. [Herzog 92a]).

:PROCEED

:START :STOP

Condition:(PROCEED (MOVE-FREE ?b) ?t)

Condition:(AND (STOP (BALL-POSS ?p1 ?b) ?t) (START (MOVE-FREE ?b) ?t))

Condition:(AND (STOP (MOVE-FREE ?b) ?t)) (START (BALL-POSS ?p2 ?b) ?t)

S 0 S 2S 1

Figure 4: Course diagram

9

Recognition of presumed goals and plans of the observed agents

For human observers the interpretation of visual information also involvesinferring the intentions, i.e. the plans and goals, of the observed agents (e.g.,player A does not simply approach player B, but he tackles him).

In the soccer domain the in uence of the agents' assumed intentions onthe results of the scene analysis is particularly obvious. Given the positionof players, their team membership and the distribution of roles in standardsituations, stereotypical intentions can be inferred for each situation. Weuse the system component described in [Retz-Schmidt 91], which is able toincrementally recognize intentions of and interactions between the agents aswell as the causes of possible plan failures.

Partially instantiated plan hypotheses taken from a plan library are suc-cessively instantiated according to the incrementally recognized events. Eachelement of the plan library contains information about necessary precondi-tions of the (abstract) action it represents as well as information about itsintended e�ect. A hierarchical organization is achieved through the decom-position and specialization relation. Observable events and spatial relationsconstitute the leaves of the plan hierarchy.

Knowledge about the cooperative (e.g., double-pass) and antagonistic be-haviour (e.g., o�side-trap) of the players is represented in the interactionlibrary. A successful plan triggers the activation of a corresponding interac-tion schema.

3.3 Presentation Planning

Following a speech-act theoretic perspective, the generation of multimediadocuments is considered as a goal-directed activity (cf. [Andr�e & Rist 90]).Starting from a communicative goal (e.g., describe the scene), a presentationplanner builds up a re�nement-style plan in the form of a directed acyclicgraph (DAG). This plan re ects the propositional contents of the potentialdocument parts, the intentional goals behind the parts as well as rhetoricalrelationships between them (cf. [Andr�e & Rist 93]). While the top of thepresentation plan is a more or less complex presentation goal, the lowest levelis formed by speci�cations of elementary presentation tasks (e.g., formulatinga request or depicting an object) that are directly forwarded to the medium-speci�c design components.

To represent presentation knowledge, we have de�ned strategies that referto both text and picture production. While some strategies re ect generalpresentation knowledge, others are more domain-dependent and specify howto present a certain subject. To utilize the plan-based approach in Vips, wede�ne new strategies for scene description. For example, the strategy shownin Fig. 5 may be used to verbally describe a sequence of events by informing

10

the user about the main events (e.g.,team-attack), illustrating them by asnapshot and to provide more details about the subevents (e.g., kick).

Header: (Describe-Scene S U ?events T)E�ect: (FOREACH ?one-ev

WITH (AND (BEL S (Main-Ev ?one-ev))(BEL S (In ?one-ev ?events)))

(BMB S U (In ?one-ev ?events)))Applicability-Conditions:

(BEL S (Temporally-Ordered-Sequence ?events))Main Acts:

((FOREACH ?one-evWITH (AND (BEL S (Main-Ev ?one-ev))

(BEL S (In ?one-ev ?events)))(Inform S U ?one-ev T)))

Subsidiary Acts:

((Illustrate S U ?ev G)(Elaborate-Subevents S U ?sub-ev ?medium))

Figure 5: Plan operator for describing a scene

To accomplish the last communicative act, the strategy shown in Fig. 6may be applied. It informs the user about all salient subevents and providesmore details about the agents involved. To determine the salience of an event,factors such as its frequency of occurrence, the complexity of its generic eventmodel, the salience of involved objects and the area in which it takes placeare taken into account (see also [Andr�e et al. 88]). All events are describedin their temporal order. Further grouping principles for events are discussedin [Maybury 91].

The strategies de�ned in Fig. 5 and Fig. 6 can be used to generate aposteriori scene descriptions. They presuppose that the input data fromwhich relevant information has to be selected are a priori given. Since bothstrategies iterate over complete lists of temporally ordered events, the pre-sentation process cannot start before the interpretation of the whole scene iscompleted.

However, Vips is also able to generate live reports. The main character-istic of this kind of presentation is that input data are continuously deliveredby a scene interpretation system and the presentation planner has to reactimmediately to incoming data. In such a situation, no global organization ofthe presentation is possible. Instead of collecting scene data and organizingthem (e.g., according to their temporal order as in the �rst two strategies),the system has to locally decide which event should be reported next consid-ering the current situation. Such behavior is re ected by the strategy shown

11

Header: (Elaborate-Subevent S U ?ev T)E�ect: (FOREACH ?sub-ev

WITH (AND (BEL S (Salient ?sub-ev))(BEL S (Sub-Ev ?sub-ev ?ev)))

(BMB S U (Sub-Ev ?sub-ev ?ev)))Applicability-Conditions:

(AND (BEL S (Sub-Events ?ev ?sub-events))(BEL S (Temporally-Ordered-Sequence ?sub-events)))

Main Acts:

((FOREACH ?sub-evWITH (AND (BEL S (In ?sub-ev ?sub-events))

(BEL S (Salient ?sub-ev)))(Inform S U ?sub-ev T)))

Subsidiary Acts:

((Elaborate-Agents S U ?sub-ev ?medium))

Figure 6: Plan operator for describing subevents

in Fig. 7. In contrast to the strategy shown in Fig. 6, events are selectedfor their topicality. Topicality is determined by the salience of an event andthe time that has passed since its occurrence. Consequently, the topicalityof events decreases as the scene progresses. If an outstanding event (e.g.,a goal kick) occurs which has to be verbalized as soon as possible, the pre-sentation planner may even give up partially planned presentation parts tocommunicate the new event as soon as possible.

Header: (Describe-Next S U ?ev T)E�ect: (AND (BMB S U (Next ?preceding-ev ?ev))

(BMB S U (Last-Reported ?ev)))Applicability-Conditions:

(AND (BEL S (Last-Reported ?preceding-ev))(BEL S (Topical ?ev *Time-Available*))(BEL S (Next ?preceding-ev ?ev)))

Main Acts: ((Inform S U ?ev T))Subsidiary Acts: ((Describe-Next S U ?next-ev T))

Figure 7: Plan operator for simultaneous description

The realization of the main act in Fig. 7 depends on whether the user hasvisual access to the scene or not. For example, an utterance, such as \payattention to the player in the penalty area" does not make much sense if theuser does not see the scene.

12

3.4 Generating textual presentation parts

As for the event recognition component, the text generator described in [Har-busch et al. 91] follows an incremental processing scheme. It can beginoutputting words before the input is complete. Such generators are more exible because they can also be used in situations where it is not possibleto delay the output until the input is complete (cf. [Finkler & Schauder 92]).However, it is no longer guaranteed that new input can always be integratedinto a previously uttered part of a sentence. In such a case, revisions arenecessary.

The �rst component that is activated during natural language generationis the text design component. As soon as the presentation planner decidesthat a particular element should be presented as part of a text, the elementis handed over as input to this component. The main task of the text designcomponent is the organization of input elements into clauses. This comprisesthe determination of the order in which the given input elements can berealized in the text and lexical choice. The results of the text designer arepreverbal messages.

These preverbal messages are forwarded in a piecemeal fashion to the textrealization component where grammatical encoding, linearization and in ec-tion take place. The text realization component is based on the formalismof Lexicalized LD/LP Tree Adjoining Grammars. It associates lexical itemswith syntactic rules, permits exible expansion operations and allows thedescription of local dominance to be separated from linear precedence rules.These characteristics made it a good candidate for incremental generation.

3.5 Generating visual presentation parts

In a system like Vips it is quite natural to base the generation of visual pre-sentations on the camera-recorded visual data and on information obtainedfrom various levels of image interpretation. For example, when generatinglive reports, one may include original camera data directly in the presenta-tion. In this case, the graphics generator will only be requested to forwardthe camera data to a video window. To deal with more interesting tasks thesystem must have appropriate generation techniques at its disposal. For thecurrent version of Vips, we have developed techniques for:

Content-based search for subsequences

A recorded image sequence can be split into subsequences of arbitrary lengthbetween one image and all images. Content-based search serve to �nd suchsubsequences according to semantic criteria. For example, one may be inter-ested in the occurrence of a particular event, or the trajectory of a certain

13

agent or object. In contrast to video transcription and presentation sys-tems, such as Ivaps [Csinger & Booth 94], Vips' graphics generator bene�tsfrom the connection to the image understanding component. Search spec-i�cations are formulated on the level of event propositions. Tracing backthe event recognition process, the original image data are localized and thecorresponding subsequences are returned.

Display style modi�cations

When displaying an image or an image sequence, material and temporal as-pects may be modi�ed in order to accomplish certain communicative goals, orto meet situation-dependent constraints, e.g, resource limitations. Concern-ing the visual appearance of objects, Vips supports photorealistic displaystyles (cf. [Herzog 92b]). Such presentations can be realized as �ltered dis-plays of the original camera frames. Starting from the propositional GSD,Vips is also able to produce schematic pictures and animations, likewise in2D or 3D. In the schematic mode, static background objects are approxi-mated by line drawings/3D models, and moving objects are represented byprede�ned icons/3D bodies. The generation of 3D animations is an inter-esting feature, since it allows the \re-recording" of a scene from arbitraryviewpoints, e.g., from the viewpoints of agents and objects involved. Con-cerning the temporal aspect of an image sequence, three display modes canbe chosen: true time (25 frames per second), slow motion, and quick motion.

... Bommer, the midfield player passes the ball to Bosch, the outside left. Bosch is attacked by Maller, the outside right.

time[6:57:80]

Figure 8: Live presentation

Data aggregation

Showing an original image sequence is often less e�ective than a presenta-tion with less visual data. This becomes obvious when dynamic conceptshave to be visualized by static graphics which are to be included in a print

14

document. The mere listing of frames is inappropriate because there is ahigh risk that an observer wouldn't see the "trees for the forest". Reportingin mass media gives valuable inspirations for enhancing the e�ectiveness ofvisual presentations by means of aggregation techniques. In Vips, we aimat operationalizations of such techniques for the production of dynamic andstatic visual presentations. For example, recorded video sequences can beshortened by cutting out less interesting frames. To �nd frames which canbe omitted without destroying the presentation, we take into account therecognized event structure of the sequence and use criteria such as spatialcoherency of objects in subsequent frames. In the case of static graphics, wealso start from the event structure to �nd the most signi�cant key frames ofa sequence. For some purposes a single key frame will su�ce, e.g, when theresult or outcome of an event has to be shown. In other situations, one mayapply techniques used in technical illustrations to aggregate the informationof several images into a single one. For example to visualize an object tra-jectory, we start from a key frame that shows the object either in its startor end position and then superimpose an arrow on the image to trace theobject's locations in succeeding or preceding frames.

Visualization of inferred information

The interpretation of image sequences may lead to information which is notdirectly apparent in the raw image data. This does, however, not mean thatinferred information cannot be presented visually. Marking objects/objectgroups by color or annotating them with text labels are simple techniquesto include additional information in a graphical presentation. For other pur-poses, superimposition techniques are more suitable. For example, whenanalyzing a soccer game, it may be of interest to wonder whether a playerhad alternative moves in a crucial situation. Provided that the image inter-pretation system is able to recognize such alternatives, they can be visualizedby superimposing hypothetical trajectories on the original scene data.

4 Generation Examples

To give an impression of howVipsworks, we present two generation examplestaken from the domain of soccer.

In the �rst example, a TV-style live report is to be generated, i.e., asoccer scene has to be described while the scene progresses. We simulatethe progress of the scene by passing the GSD data incrementally to therecognition component. We choose as presentation media text and videowhich, in this case, means displaying the original image sequence withoutfurther modi�cations. For this kind of presentation, video is considered as

15

the guiding medium to which textual comments have to be tailored.The example is illustrated by Fig. 8 which shows a part of the coordinated

stream of visual and textual output. To describe the underlying generationprocess, we start at the image framemarked by the timestamp [6:57:80] whichis displayed shortly after a preceding utterance has been completed.

To select the next event to be textually communicated, the presentationplanner applies the strategy shown in Fig. 7. When testing the applicabilityconditions of this strategy, the variable ?preceding-ev is instantiated with thelast proposition that has been verbalized. To instantiate the variable ?ev, thepresentation planner searches for a topical event that comes after ?preceding-ev. In the example, ?ev is bound to (Ball-Transfer :Agent player#6 :Objectball :Recipient nil :Begin [6:57:50]). This event is selected since it is the onlyevent in which the ball is involved as one of the most salient objects.

After the re�nement of (Inform S U ?ev T), the following acts have beenposted as new subgoals: four referential acts for specifying the action andits associated case roles and an elementary surface speech act, S-Inform,that is passed on to the text designer. Note that the presentation plannerforwards a certain piece of information to the generator concerned as soonas it has decided which component should encode it. In our example, (S-Inform S U ...) is sent to the text generator although a content speci�cationfor the recipient is still missing. The text designer creates input for the TAG-based realization component which starts processing this input and generates:\Bommer, the mid�eld player, passes the ball ...".

In the meantime, the event recognition component has identi�ed the re-cipient of the ball (player#7). This new information allows the presentationplanner to determine the following content speci�cation: (the ?z (name ?zBosch) (outside-left ?z)). Thus, the incomplete speci�cation that has beensent to the text generator is supplemented accordingly. In this case, the textgenerator is able to complete the sentence just by adding the prepositionalphrase \to Bosch, the outside left.". Of course, there are also situations inwhich revisions are necessary (cf. [Wahlster et al. 93]).

Meanwhile, the presentation planner has again applied the strategy shownin Fig. 7, and ?ev is bound to (Attack :Agent player#20 :Patient player#7).After completing the last sentence, the realization component generates \Heis attacked by Maller, the outside right."

In the second example, we assume that a retrospective description of apast scene is to be generated in a format that can be printed on paper. Inthis case, the system has to accomplish the goal (Describe-Scene S U ?eventsT) whereby the variable ?events is bound to a list of temporally orderedevents delivered by the recognition component. The presentation planner�rst determines the main events and forwards a content speci�cation to thetext generator. In addition, it requests the graphics generator to illustratethe course of the event. Since only static graphics can be printed on paper,

16

In the 15th minute, team A started an attack. Bösel (8), the outside left, centered the ball to Britz (9) in front of the goal. The goal keeper (1) inter- cepted the ball.

Figure 9: A posteriori report

it's not possible to include the original video sequence in the presentation.Therefore, the graphics designer starts with a snapshot showing the positionsof the players at the beginning of the events and relies on data aggregationto encode the trajectories of the moving objects (cf. Fig. 9). During the gen-eration of the illustration, the presentation planner has expanded (Elaborate-Subevents S U ?sub-ev ?medium) to determine which information about thesubevents should be communicated to the user. In order to facilitate referentidenti�cation, the system has attached the numbers used in the icons to theexpressions referring to the players.

17

5 Summary

In this contribution, we have reported on our e�orts to bridge from com-puter vision to multimedia generation. We have outlined the system Vips

that takes camera recorded image sequences as input and uses incrementalstrategies for the recognition of higher-level concepts such as spatial rela-tions, motion events and intentions, and relies on a plan-based approach tocommunicate recognized occurrences with multiple presentation media.

Implementations of most of the core modules (scene interpretation, pre-sentation planner, text generator,) are already available and allow the au-tomatic generation of textual descriptions for short image sequences. Theknowledge-base of the system currently consists of about 100 concept de�ni-tions for spatial relations, motion events, plans, and plan interaction schemata.As yet, the graphics generation component only provides the basic functions(display of video sequences/single frames, icon-based visualization of trajec-tory data). To generate presentation examples, as presented in section 4,interfacing between some components still has to be done manually. Ourcurrent e�orts aim at a fully integrated version of the Vips system withimproved graphics capabilities.

Perhaps the most interesting topic for further research is the bidirectionalinterleaving of image interpretation and presentation planning. In some sit-uations, it would be useful for the presentation planner to request particularinformation from the interpretation system which may eventually force thevision system to actively achieve conditions under which this informationcan be obtained, e.g., by changing the sensor parameters. Active vision isparticularly required when information is missing which is needed to decidewhether an applicability condition of a presentation strategy is satis�ed ornot. Furthermore, this feature could be used for the generation of visualpresentation fragments - one simply drives the camera to obtain a certainpicture or video clip.

Acknowledgements

The work described in this paper was partly supported by the Special Col-laborative Program on AI and Knowledge-based Systems (SFB 314), projectVITRA, of the German Science Foundation (DFG) and by the German Min-istry for Research and Technology (BMFT) under grant ITW8901 8, projectWIP. We would like to thank Wolfgang Wahlster who as a leader of bothprojects made this cooperation possible.

18

References

[Andr�e & Rist 90] E. Andr�e and T. Rist. Towards a Plan-Based Synthesisof Illustrated Documents. In: Proc. of the 9th ECAI, pp. 25{30,Stockholm, 1990.

[Andr�e & Rist 93] E. Andr�e and T. Rist. The Design of Illustrated Doc-uments as a Planning Task. In: M. T. Maybury (ed.), IntelligentMultimedia Interfaces, pp. 94{116. Menlo Park, CA: AAAI Press,1993.

[Andr�e & Rist 94] E. Andr�e and T. Rist. Generating Coherent Presenta-tions Employing Textual and Visual Material. Arti�cial IntelligenceReview Journal, 8(3), 1994.

[Andr�e et al. 87] E. Andr�e, G. Bosch, G. Herzog, and T. Rist. Copingwith the Intrinsic and the Deictic Uses of Spatial Prepositions. In:K. Jorrand and L. Sgurev (eds.), Arti�cial Intelligence II: Methodol-ogy, Systems, Applications, pp. 375{382. Amsterdam: North-Holland,1987.

[Andr�e et al. 88] E. Andr�e, G. Herzog, and T. Rist. On the Simultane-ous Interpretation of Real World Image Sequences and their NaturalLanguage Description: The System SOCCER. In: Proc. of the 8thECAI, pp. 449{454, Munich, 1988.

[Arens et al. 93] Y. Arens, E. Hovy, and S. van Mulken. Structure andRules in Automated Multimedia Presentation Planning. In: Proc. ofthe 13th IJCAI, pp. 1253{1259, Chambery, France, 1993.

[Csinger & Booth 94] A.Csinger and K. S.Booth. Reasoning about Video:Knowledge-based Transcription and Presentation. In: J. F. Nuna-maker and R. H. Sprague (eds.), HICSS-94, volume III, InformationSystems: Decision Support and Knowledge-based Systems, pp. 599{608, Maui, HI, 1994.

[Feiner & McKeown 93] S. K. Feiner and K. R. McKeown. Automatingthe Generation of Coordinated Multimedia Explanations. In: M. T.Maybury (ed.), Intelligent Multimedia Interfaces, pp. 117{138. MenloPark, CA: AAAI Press, 1993.

[Finkler & Schauder 92] W. Finkler and A. Schauder. E�ects of Incremen-tal Output on Incremental Natural Language Generation. In: Proc.of the 10th ECAI, pp. 505{507, Vienna, 1992.

19

[Gapp 94] K.-P. Gapp. Basic Meanings of Spatial Relations: Computationand Evaluation in 3D Space. In: Proc. of AAAI-94, pp. 1393{1398,Seattle, WA, 1994.

[Grice 75] H. P. Grice. Logic and Conversation. In: P. Cole and J. L.Morgan (eds.), Speech Acts, pp. 41{58. London: Academic Press,1975.

[Harbusch et al. 91] K. Harbusch, W. Finkler, and A. Schauder. In-cremental Syntax Generation with Tree Adjoining Grammars. In:W. Brauer and D. Hernandez (eds.), Verteilte K�unstliche Intelligenzund kooperatives Arbeiten: 4. Int. GI-Kongre� Wissensbasierte Sys-teme, pp. 363{374. Berlin, Heidelberg: Springer, 1991.

[Herzog & Wazinski 94] G. Herzog and P. Wazinski. VIsual TRAnslator:Linking Perceptions and Natural Language Descriptions. Arti�cialIntelligence Review, 8(2/3):175{187, 1994.

[Herzog et al. 89] G.Herzog, C.-K. Sung, E. Andr�e, W. Enkelmann, H.-H.Nagel, T.Rist, W.Wahlster, and G. Zimmermann. Incremen-tal Natural Language Description of Dynamic Imagery. In: C. Freksaand W. Brauer (eds.), Wissensbasierte Systeme. 3. Int. GI-Kongre�,pp. 153{162. Berlin, Heidelberg: Springer, 1989.

[Herzog 92a] G.Herzog. Utilizing Interval-Based Event Representations forIncremental High-Level Scene Analysis. In: M. Aurnague, A. Bo-rillo, M. Borillo, and M. Bras (eds.), Proc. of the 4th InternationalWorkshop on Semantics of Time, Space, and Movement and Spatio-Temporal Reasoning, pp. 425{435, Chateau de Bonas, France, 1992.

[Herzog 92b] G.Herzog. Visualization Methods for the VITRA Workbench.Memo 53, Universit�at des Saarlandes, SFB 314 (VITRA), 1992.

[Koller et al. 92] D. Koller, N. Heinze, and H.-H. Nagel. AlgorithmicCharacterization of Vehicle Trajectories from Image Sequences by Mo-tion Verbs. In: Proc. of IEEE Conf. on Computer Vision and PatternRecognition, pp. 90{95, Maui, Hawaii, 1992.

[Maybury 91] M. T.Maybury. Planning Multisentential English Text UsingCommunicative Acts. PhD thesis, Rome Air Development Center, AirForce Systems Command, Gri�s Air Force Base, NY, 1991.

[Maybury 93] M. T. Maybury. Planning Multimedia Explanations UsingCommunicative Acts. In: M. T. Maybury (ed.), Intelligent Multime-dia Interfaces, pp. 60{74. 1993.

20

[Neumann 89] B. Neumann. Natural Language Description of Time-Varying Scenes. In: D. L. Waltz (ed.), Semantic Structures: Ad-vances in Natural Language Processing, pp. 167{207. Hillsdale, NJ:Lawrence Erlbaum, 1989.

[Retz-Schmidt 91] G. Retz-Schmidt. Recognizing Intentions, Interactions,and Causes of Plan Failures. User Modeling and User-Adapted In-teraction, 1:173{202, 1991.

[Rohr 94] K.Rohr. Towards Model-based Recognition of Human Movementsin Image Sequences. Computer Vision, Graphics, and Image Process-ing (CVGIP): Image Understanding, 59(1):94{115, 1994.

[Roth et al. 91] S. F.Roth, J.Mattis, and X.Mesnard. Graphics and Nat-ural Language as Components of Automatic Explanation. In: J. W.Sullivan and S. W. Tyler (eds.), Intelligent User Interfaces, pp. 207{239. New York, NY: ACM Press, 1991.

[Stock 91] O. Stock. Natural Language and Exploration of an InformationSpace: The ALFresco Interactive System. In: Proc. of the 12th IJCAI,pp. 972{978, Sidney, Australia, 1991.

[Sung 88] C.-K. Sung. Extraktion von typischen und komplexen Vorg�angenaus einer langen Bildfolge einer Verkehrsszene. In: H. Bunke,O. K�ubler, und P. Stucki (Hrsg.), Mustererkennung 1988, pp. 90{96.Berlin, Heidelberg: Springer, 1988.

[Tsotsos 85] J. K. Tsotsos. Knowledge Organization and its Role in Rep-resentation and Interpretation for Time-Varying Data: the ALVENSystem. Computational Intelligence, 1:16{32, 1985.

[Wahlster et al. 83] W. Wahlster, H. Marburger, A. Jameson, andS. Busemann. Over-answering Yes-No Questions: Extended Re-sponses in a NL Interface to a Vision System. In: Proc. of the 8thIJCAI, pp. 643{646, Karlsruhe, FRG, 1983.

[Wahlster et al. 93] W.Wahlster, E. Andr�e, W. Finkler, H.J. Pro�tlich,and T. Rist. Plan-Based Integration of Natural Language and Graph-ics Generation. Arti�cial Intelligence, 63:387{427, 1993.

[Walter et al. 88] I.Walter, P. C. Lockemann, and H.-H.Nagel. DatabaseSupport for Knowledge-Based Image Evaluation. In: P. M. Stocker,W. Kent, and R. Hammersley (eds.), Proc. of the 13th Conf. on VeryLarge Databases, Brighton, UK, pp. 3{11. Los Altos, CA: MorganKaufmann, 1988.

21