Annotating Gigapixel Images · 2018-01-29 · zoomed out. Thus, just as a view has a position and an extent deﬁned by the zoom level, so does an annotation. The anno-tations themselves

Annotating Gigapixel ImagesQing Luan ∗

USTC †

Hefei, Chinaqing 182@hot-

mail.com

Steven M. DruckerMicrosoft Live Labs

Redmond, WAsdrucker@micro-

soft.com

Johannes KopfUniversity of Konstanz

Konstanz, [email protected]

Ying-Qing XuMicrosoft ResearchAsia Beijing, China

[email protected]

Michael F. CohenMicrosoft Research

Redmond, WAmcohen@micro-

soft.com

ABSTRACTPanning and zooming interfaces for exploring very large im-ages containing billions of pixels (gigapixel images) have re-cently appeared on the internet. This paper addresses issuesthat arise when creating and rendering auditory and textualannotations for such images. In particular, we define a dis-tance metric between each annotation and any view result-ing from panning and zooming on the image. The distancethen informs the rendering of audio annotations and text la-bels. We demonstrate the annotation system on a number ofpanoramic images.ACM Classification: H5.2 [Information interfaces and pre-sentation]: User Interfaces. - Graphical user interfaces.General terms: Design, Human Factors

IntroductionImages can entertain us and inform us. Enhancing imageswith visual and audio annotations can provide added valueto the images. Supporting the authoring of and deliveringof annotations in online images has taken many forms rang-ing from advertising, to community annotations of images inFlickr, to the multiple types of annotations embedded in ap-plications depicting satellite imagery such as Google Earthand Virtual Earth. Last year we saw the introduction of sys-tems to create and view very large (gigapixel) images [10],along with the introduction of new viewers for such images(e.g., Zoomify and HD View). Much like in the earth browsers,when viewing gigapixel imagery, only a tiny fraction of theimage data is viewable at any one time. For example, whenviewing a 5 gigapixel image on a 1 megapixel screen only1/5000th of the data is ever seen. Exploring the imagery issupported by a panning and zooming interface.In this short paper, we explore many of the issues that arisewhen annotating and rendering annotations within very largeimages. Our main contribution is a model to represent theviewer’s perceptual distance from the objects being anno-tated in the scene while continuously panning and zooming.This model informs an annotation rendering system for whenand how to render both auditory and visual annotations. We

∗This work was done while Qing Luan was a visiting student at MicrosoftResearch and Microsoft Research Asia† USTC: Unversity of Science and Technology of China

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.UIST’08, October 19–22, 2008, Monterey, California, USA..Copyright 2008 ACM 978-1-59593-975-3/08/10 ...$5.00.

demonstrate the annotation system within HD View1, an in-ternet browser hosted viewer based on the ideas in [10].There are many things this paper does not address includingautomating the addition of annotations from web-based con-tent, searching annotations within images, or the automatedlabel layout problem. Each of these problems have been ad-dressed in other contexts and the lessons learned from themcan be applied here. Instead we focus on what is uniqueabout annotations within the continuous panning and zoom-ing environment. Thus, in addition to annotations of gigapixelimagery, our work is perhaps most directly applicable to theearth browsing systems.Related WorkOur work draws from a number of areas ranging from ZoomableUIs, to map labeling, human psychophysics, and virtual andaugmented reality systems.Zoomable UIs (ZUIs) have been popularized by such sys-tems as PAD [12] and PAD++ [2] which introduce the ideaof navigating through a potentially infinite 2D space by pan-ning and zooming a small view into that space. They includehiding objects when they are below a certain minimal magni-fication threshold. In most ZUIs, the threshold is based solelyon the screen extent that an object occupies as opposed tosome notion of perceptual distance to an object as we model.There has been a wide body of literature focusing on labelplacement on maps beginning as early as 1962 with Imhof [8]and continuing today [18]. Much of this work is concernedwith the automatic placement of labels which minimize over-lap while optimizing nearness to named features. A thor-ough bibliography can be see at the Map-Labeling Bibliog-raphy2. We generally do not address the issue of laying outlarge numbers of labels as our model implicitly controls thedensity of labels before layout.Interactive map systems such as Google Earth and VirtualEarth are perhaps the closest to our work. They provide apanning and zooming interface in which various kinds of an-notations appear and disappear. Work in this area has focusedon avoiding visual artifacts such as label popping as well asassuring that labels will appear at interactive rates [3]. Somework has been devoted to avoiding visual clutter which canimpair map reading performance [13].There has been extensive work on adding audio and textuallabels to both virtual and augmented reality systems [7, 1,4, 19]. It is clear that a greater sense of presence can beachieved by adding binaural sound rendering to a virtual en-vironment [11] and there have been systems that focus on theefficient rendering of spatialized audio in complicated virtualenvironments [17]. Much of this work can be guided by the

1http://research.microsoft.com/ivm/HDView.htm2http://www.math.drofnats.edu/riemann.ps

Figure 1: A Seattle panorama and associated very coarsehand painted depth map.

psychophysics of human sound localization [5]. Likewise,the automatic placement of labels within both virtual realityand augmented reality systems needs to consider such issuesas frame to frame coherence, label readability, appropriateview selection, and occlusion of objects and other labels inthe environment. While our work is related, we emphasizeinstead rapid annotation of large images instead of annotationof a true 3D environment and thus cannot use more straight-forward physical models for audio rendering or label place-ment.Annotations and ViewsDue to their size, gigapixel images are typically never storedas a single image, but rather are represented as a multi-resolutionpyramid of small tiles that are downloaded and assembledon-the-fly by the viewer. We use HD View that runs withinmultiple internet browsers. Conceptually, the image consistsof a grid of pixels inx×ywhere, depending on the projection(perspective, cylindrical, spherical) of the underlying image,there is a mapping fromx and y to directions in space. Givena virtual camera, HD View renders the appropriate view de-pending on the specific orientation, pan and zoom parametersof the virtual camera. For the purposes of the discussion here,we normalize x to lie in [0, 1]while y varies from 0 to ymaxdepending on the aspect ratio of the image. Optionally, depthvalues across the image, d(x, y), can be provided. In the an-notation system, d represents the log of the scene depth. Inour examples, a rough depth map consisting of three to five(log) depth layers is painted by hand and low-pass filtered(see Figure 1). The system is quite forgiving of inaccuracies.

Gigapixel AnnotationsAnnotations in gigapixel images reference objects within animage. For example, in a cityscape, an annotation may re-fer to a region of the city, a building, or a single person onthe street that cannot be seen due to its small size when fullyzoomed out. Thus, just as a view has a position and an extentdefined by the zoom level, so does an annotation. The anno-tations themselves are specified from within the interactiveviewer while panning and zooming. The user simply drawsa rectangle in the current view indicating the extent of theobject being annotated. The annotation position, (xA , yA ) isset as the center of the rectangle. The annotation’s “field ofview”, f A is set by the size of the annotation rectangle,

f A = p (xright − xlef t) · (ytop − ybottom)of the annotation rectangle. Thus an annotation can be said tobe located at (xA , yA , fA , dA ) where dA = d(xA , yA ) (SeeFigure 2).Currently, annotations can be one of three types: a text label,an audio loop, or a narrative audio. Many other types canbe supported within the same framework such as hyperlinks,links to images, etc. For audio, the annotations are associatedwith a .wav file. Text label annotations contain a text stringas well as an offset within the rectangle and possible leaderline to guide final rendering.

Figure 2: Parameters of an annotation and view.Gigapixel Image ViewGiven the pan and zoom, the center of the view has somecoordinate (xv, yv) and some field of view f v relative to thefull image that defines the x and y extents of the view. Wesay that f v = 1 when the image is fully zoomed out andvisible, and f v = 0.5when zoomed in so half the width ofthe full image is within the browser frame, etc. Thus, at anyzoom level f v = xright − xlef t of the current view.Depth of the Viewer We set the depth of the viewer to bethe value of the depth map at the center of the screen, dv =d(xv, yv). As we discuss later, this depth value plays an in-creasing role as we zoom in to the image. The viewpoint isin reality fixed, but perceptually as one zooms into the im-age, there is also a perception of moving closer to the objectsin the scene. The word for this perceived motion is vection:the perception of self-motion induced by visual stimuli [6].Vection has been studied primarily for images in which thecamera actually moves forward inducing motion parallax be-tween near and far objects. In our case, zooming inducesoutward optical flow but no parallax. The ambiguity be-tween narrowing the field of view (zooming) and dollyinginto the scene (moving forward) results in similar vection ef-fects. This is supported by a number of studies [15, 14, 16].A single center value to represent the depth was chosen toavoid on-the-fly ”scene analysis” to keep the viewing experi-ence interactive.Perceived Field of View We make one further modificationto the specification of the view. As the user rapidly pansand zooms, it is hypothesized that users are more aware oflarger objects and when stopped on a particular view theybecome more aware of smaller objects. This notion is sup-ported in the perception literature by studies on the changesin the spatial contrast sensitivity function (SCSF: our abil-ity to distinguish the fluctuations fine sine gratings) betweenmoving and still images [9]. We cannot see so much highfrequencies in an image when it is moving across our visualfield as when it is still. Conversely, we become more awareof very low frequencies when the image moves.Both of these perceptual effects are captured by establishinga perceived field of view value, f v, that grows with motionand shrinks when still. This is implemented as follows. Afield of view multiplier, mf at time zero is initialized to be1.0, mf (0) = 1. At each time step, this multiplier is in-creased if the view is changing and decreased if the view isstatic.More formally, a variable m(t) is an indicator of motion.m(t) = cf if there has been any panning or zooming motionof the view between time t − 1and time t, and m(t) = 1/cf

Figure 3: Three views of an annotated gigapixel image ofSeattle (a) and Yosemite (b)

if the view is still. cf is a parameter that controls the strengthof the maximum and minimum values the multiplier con-verges to. We have set cf to 1.5 which corresponds roughlyto changes in the SCSF for motion of 2 degrees of visual an-gle per second. Further study could be done to see if thisvalue should vary based on motion speed. Thus, at each timestep:mf (t) = β m(t) + (1 − β) mf (t − 1)and finally: f v = mf f vwhere β controls how fast the motion effect varies. A valueof β approximately one over the frame rate works well, orapproximately 0.3. Thus, as mf varies slowly between cfand 1/cf , the effective zoom grows and shrinks accordingly.Thus a view is fully specified by its position, perceptual size,and the depth value at its center. This is captured by the tuple(xv, yv, f v, dv) (See the blue doted rectangle in Figure 2).Annotation StrengthGiven a gigapixel image, a set of annotations, the currentview, and some view history, the annotation rendering systemdecides which annotations to render (visual or audio), whatstrength each should have, and where to place the annotation(label position or spatialized stereo). The strength of eachannotation is inversely correlated to the distance between thecurrent view and the annotation.Distance Between an Annotation and a ViewTo determine the strength with which to render each anno-tation, we begin by computing four distance values betweenthe view and the annotation:Xdist = |xA − xv| describes the horizontal offset betweenthe view and each annotation.Y dist = |yA − yv| describes the vertical offset between theview and each annotation.F dist = |f v − fA |/ f v if f v > fA (while zooming in to thefield of view of the annotation), andF dist = |f v −f A |/(1 −

f v) otherwise (i.e., when we are zooming in beyond the fieldof view of the annotation). Intuitively, F distmeasures howlarge the object being annotated is relative to the view, be-coming zero when the object would fill the screen.Ddist = cd|dA − dv| · (1 −f v), thus as we zoom in, (i.e., f vgets smaller), the differences in depths takes on an increasingrole. A narrow field of view invokes a stronger sensation ofbeing at the depth of the object than we have with a wider an-gle view. cd normalizes the depth difference term, typicallyset to 1/(dmax − dmin ).

D = √ Xdist2 + Y dist2 + F dist2 + Ddist2Initial Strength of an AnnotationFinally, the initial strength, A, of each annotation drops offwith distance:A = exp(−D/σD )where σD controls the drop off of the annotations with dis-tance. We have found a default value of σD = 0.1to workwell. σD , however, is the one parameter it makes sense toput in the user’s hands. By varying from σD from small val-ues to large, the user can control the whether only those an-notations in the immediate central view (i.e., have small Dvalues) carry any strength, or with larger σD , all annotationscarry more even strengths.Ambient AnnotationsIn addition to the standard annotations, there is one addi-tional ambient audio and label annotation. These annota-tions are global and carry a constant weight, A0, which wecurrently set to 0.2. The ambient audio annotation providesbackground audio. The ambient label annotation is typicallyjust a null annotation. The ambient audio volume and influ-ence of the null text annotation diminish as other annotationsgain strength due to the normalization described next.NormalizationTo maintain an approximate constancy of annotations we nor-malize the strength of each annotation relative to the total ofthe strengths including the ambient term.A i = Ai /

Pi A i

This normalization is done separately for the set of audio an-notations and the set of text label annotations.HysteresisFinally, we add a hysteresis effect to the strengths associatedwith each annotationA(t) = α+ A(t) + (1 − α+) A(t − 1)for rising strengths,

A(t) = α− A(t) + (1 − α−) A(t − 1)for falling strengths,so that the final strength of each annotation varies slowly.We set α+ = 0.2, and α− = 0.05. The final strength A isguaranteed to lie in the interval [0, 1].Rendering the AnnotationsGiven the strength, A, for each annotation, we are now readyto render the annotations. The panorama is rendered by HDView using DirectX within an internet browser. Text labelsare drawn in the overlay plane.

Audio Loop AnnotationsAudio loop (ambient) annotations are rendered with volumedirectly correlated with the strength A. We do, however,

modulate the left and right channels to provide stereo direc-tionality to the audio. Signed versions of Xdist and DdistXdist signed = xA − xvDdistsigned = Sign(dA − dv)(cd|dA − dv|)provide the angle atan(Xdistsigned /Ddist signed) betweenthe view direction and the annotation center which deter-mines the relative left and right volumes.Audio Narrative AnnotationsAudio narrative annotations are intended to be played lin-early from the start onward. We set two thresholds on thestrength. One specifies when a narrative annotation shouldbe triggered to start. When triggered, the narrative begins atfull volume. At some lower strength threshold, the narrativebegins to fade in volume over 3 seconds until it is inaudible.If the user moves back towards the narrative source while itis still playing the narrative continues and regains volume.Once it has stopped, however, the narrative will not beginagain until some interval (currently set to 20 seconds) haspassed. As in the looping audio annotations, the narrative isalso modulated in stereo. Finally, if one narrative is playing,no other narrative can be triggered to play.Text LabelsThe appearance and disappearance of text labels are also trig-gered by thresholds. As in the narrative annotations text an-notations are triggered to fade in over one second at a givenstrength value. They are triggered to fade over one second ata somewhat lower threshold.There are two trivial ways to set the text size. It can havea fixed screen size, or can be a fixed size in the panoramacoordinates. Unfortunately, in the former case, even thoughthe true size does not change, it will appear to shrink as onezooms in since the context is growing around it. In the lat-ter case, the text will be too small to read when zoomed outand will appear to grow and seem enormous when zoomedin. Instead, we compromise between these two cases. Morespecifically,TextSize = ctext (γ + (1 − γ)zA /zv)where our defaults are ctext = 16pointand γ = 0.5. Thisresults in a perceptually more uniform text size even thoughthe text in fact grows as one zooms in. Although there hasbeen some earlier work on determining label size based onobject importance [19], we have not seen any notion of thedynamic perceptual size constancy during zooming appliedbefore.Parameter SettingAs all parameters in the system can be set by educated intu-ition, they required very little trial and error. The parametershad the same values in all examples. However, the ambi-ent and hysteresis parameters are somewhat a matter of taste:smaller values lead to more responsive but jumpier behavior.Results and DiscussionWe have demonstrated a system for annotating very large im-ages viewed within a panning and zooming interface. Figure3 shows some screen shots of text annotations within largepanoramas. For demos of the system, please visit our web-site 3.3http://research.microsoft.com/˜cohen/GigapixelAnnotations/GigaAnnotations.htm

Our primary contribution is a distance function between theindividual annotations and views of the image that guide therendering of both audio and text annotations.In this short technote we do not report on a formal study. However, we de-moed the application at a 2-day event to 5,000 attendees togreat enthusiasm. A more formal study would certainly helpconfirm what we have observed.REFERENCES

1. Ronald Azuma and Chris Furmanski. Evaluating label place-ment for augmented reality view management. In Proc. IEEEand ACM Int’l Symp. on Mixed and Augmented Reality, 2003.

2. Benjamin B. Bederson and James D. Hollan. Pad++: Azoomable graphical interface. In Proceedings of ACM HumanFactors in Computing Systems Conference Companion.

3. Ken Been, Eli Daiches, and Chee Yap. Dynamic map labeling.Transactions on Visualization and Computer Graphics, 2006.

4. Blaine Bell, Steven Feiner, and Tobias Hollerer. View man-agement for virtual and augmented reality. In Proceedings ofthe 14th annual ACM symposium on User interface softwareand technology, 2001.

5. J. Blauert. Spatial hearing: The psychophysics of humansound localization. 1983.

6. T. R. Carpenter-Smith, R. G. Futamura, and D. E. Parker. Iner-tial acceleration as a measure of linear vection: an alternativemagnitude estimation. Perception and Psychophysics.

7. K. K. P. Chan and R. W. H. Lau. Distributed sound renderingfor interactive virtual environments. In International Confer-ence on Multimedia and Expo, 2004.

8. Eduard Imhof. Die anordnung der namen in der karte. Inter-nat. Yearbook of Cartography, pages 93–129, 1962.

9. D. H. Kelly. Motion and vision. ii. stabilized spatio-temporalthreshold surface. J. Opt. Soc. Am., 1979.

10. Johannes Kopf, Matt Uyttendaele, Oliver Deussen, andMichael Cohen. Capturing and viewing gigapixel images.ACM Transactions on Graphics, 2007.

11. P. Larsson, D. Vastfjall, and M. Kleiner. Better presence andperformance in virtual environments by improved binauralsound rendering. In proceedings of the AES 22nd Intl. Conf.on virtual, synthetic and entertainment audio, 2002.

12. Ken Perlin and David Fox. Pad: An alternative approach tothe computer interface. Computer Graphics, 27.

13. R. J. Phillips and L. Noyes. An investigation of visual clutterin the topographic base of geological map. Cartographic J.,19(2):122–131, 1982.

14. H. J. Sun and B. J. Frost. Computation of different opticalvelocities of looming objects. Nature Neuroscience, 1998.

15. Spratley J. Telford, L. and B. J. Frost. The role of kineticdepth cues in the production of linear vection in the centralvisual field. Perception, 1992.

16. L. C. Trutoiu, S.D. Marin, B. J. Mohler, and C. Fennema.Or-thographic and perspective projection influences linear vec-tion in large screen virtual environments. page 145. ACMPress, 2007.

17. N. Tsingos, E. Gallo, and G. Drettakis. Perceptual audio ren-dering of complex virtual environments. 2004.

18. A. Woodruff, J. Landay, and M. Stonebraker. Constant infor-mation density in zoomable interfaces. In Proceedings of theWorking Conference on Advanced Visual interfaces, 1998.

19. Fan Zhang and Hanqiu Sun. Dynamic labeling managementin virtual and augmented environments. In Ninth InternationalConference on Computer Aided Design and Computer Graph-ics, 2005.

Documents

Annotating Gigapixel Images · 2018-01-29 · zoomed out. Thus, just as a view has a position and an extent deﬁned by the zoom level, so does an annotation. The anno-tations themselves